Update: May 14, 2017

Since submitting this script back in January 2016, which mentioned the possibility of putting a Shiny GUI on it, I did indeed wrap it in Shiny as well as update the code, model and data. If you’re interested, go to the web-based Shinyapp “Best Colleges for You” – hosted at shinyapps.io by RStudio.
Have fun!

And let me know what you think at my LinkedIn page.

Introduction

The US College Scorecard dataset provides everyday folks, like myself, access to a rich source of US college information. Yet, it’s quite a challenge making sense out of such a sprawling dataset. And even moreso to make sense out of it in a way that captures the specific context of a student’s interest in determining which colleges are good candidates to attend or not.

Before jumping into the analysis, I’d like to acknowledge those folks who generously shared their scripts on Kaggle.com from which I learned – and outright copied – code snips to get this done. In particular, the code provided by Ben Hamner on using package ‘RSQLite’ and by Tad Dallas on using package ‘leaflet’ for map plotting. Thank you!!!

Comments:

ALSO: To see the payoff, go straight to the Sensitivity Analysis section.

My intent was to render the R script as a PDF-format notebook using RStudio’s File–>Compile Notebook… menu command. But it was just taking too long for me to get all the formatting of plots, lists and tables right – so I bailed!

The R script version runs fine when sourced at the R console or in RStudio. In fact, that’s the only way I could get the leaflet package’s interactive mapping to display – until I changed print(map) to just plain map.

Any help in cleaning up the formatting would be much appreciated.

Objective

I set out to build a decision analysis tool that would be useful to an individual, like me, for judging the suitability of colleges specifically attuned to my criteria. The analysis is based upon a model that combines the US College Scorecard data and my personal insights – i.e., domain expertise – into something useful for the following:

  1. Estimating what specific type student is attracted to each college;
  2. Estimating what specific types of student are most likely to thrive at each college; and
  3. Predicting whether a student with a specific profile of interests & character traits would thrive at a specific college or not.

This is still but a preliminary proof-of-concept level of analysis to illustrate the possibilities with this framework. Ideally, folks will see this, revise, complete and extend it.

Modeling Plan

My modeling plan looks like this:

  1. Explore the dataset with unsupervised learning and data visualization to understand some of the underlying structure of the data.
  1. Create archetypical behavioral characteristics of students and map those behavioral traits to variables in the Scorecard dataset.
  2. Build probabilistic choice models that estimate which colleges maximizes a specific student’s utility according to the students personalize valuation coefficients (i.e. partworths).
  3. Validate (with personal experience/anecdotal evidence) whether or not specific profiles predicted to thrive at specific colleges in each context make sense or not.

Data Pre-Processing & Exploration

I’ve done a ton of pre-processing, plotting and data transformations, even unsupervised learning, to explore the data. I’m going to skip the presentation of those exploration pieces here but instead just show the data winnowing and wrangling I did in preparation for modeling.

Let’s start by loading the packages, downloading & installing them if necessary:

library('RSQLite')   # for SQLite manipulation of database
library('magrittr')  # for piping with infix %>%, %<>%, %T>% (for cool, sophistication effect)
library('tidyr')     # for reshaping data (e.g., 'gather')
library('dplyr')     # for data wrangling (e.g., 'filter', 'select', 'summarize', 'inner_join')
library('ggplot2')   # for best-in-class visualization
library('gridExtra') # for grid layout of ggplots
#  library('poLCA')     # for estimation of latent classes underlying categorical variables
library('bnlearn')   # for learning BBN from datasets, and also provides function "discretize"
#  library('gRain')     # for Bayesian belief network (BBN) inference engine
library('leaflet')   # for mapping of the top schools for each student
library('htmltools') # for assistance with leaflet
library('RColorBrewer') # for color palette management

After orienting myself a little with the data documentation,…, I started by borrowing code from Ben Hamner’s “Exploring the US College Scorecard Data” script:

db <- dbConnect(dbDriver("SQLite"), "../input/database.sqlite")
# This stops SQLite from writing temp files to disk, which use all the available space
dbGetQuery(db, "PRAGMA temp_store=2;") 
## Warning in rsqlite_fetch(res@ptr, n = n): Don't need to call dbFetch() for
## statements, only for queries
## data frame with 0 columns and 0 rows
tables <- dbGetQuery(db, "SELECT Name FROM sqlite_master WHERE type='table'")
colnames(tables) <- c("Name")
tables %<>%
  rowwise() %>%
  mutate(RowCount=dbGetQuery(db, paste0("SELECT COUNT(Id) RowCount FROM ", Name))$RowCount[1])
## Warning: package 'bindrcpp' was built under R version 3.4.1
show(tables)
## Source: local data frame [1 x 2]
## Groups: <by row>
## 
## # A tibble: 1 x 2
##        Name RowCount
##       <chr>    <int>
## 1 Scorecard   124699
allFields <- dbListFields(db,tables$Name[1])
show(length(allFields))
## [1] 1731

I also downloaded the data dictionary and relied upon it heavily to understand the database contents. Here I read it in and “revise” it by getting rid of the strange characters that were meant to be apostrophe’s and, more importantly, filled in the blank cells of its tabular format so it shows up like a standard data.frame useful for automated processing:

if(!exists("dataDict")) {
  if(!file.exists("CollegeScorecardDataDictionary-09-12-2015_REVISED.csv")) {
    # JUST ASSUMES THAT FILE "CollegeScorecardDataDictionary-09-12-2015.csv"
    # DOES EXIST IN THE WORKING DIRECTORY:
    dataDict  <- read.csv(file="../input/CollegeScorecardDataDictionary-09-12-2015.csv",
                          stringsAsFactors = FALSE)
    show(head(dataDict))
    str(dataDict)
    
    show(setdiff(allFields,dataDict$VARIABLE.NAME))
    show(setdiff(dataDict$VARIABLE.NAME,allFields))
    
    dataDict <- dataDict %>% mutate(LABEL=gsub("^(.*)’(.*)$","\\1'\\2",LABEL))
    # Fill-down
    fillDown <- function(df=dataDict,fillCols=c(1:5,8)) {
      nr   <- nrow(dataDict)
      irow <- 1
      while(irow < nr){
        while(df$NAME.OF.DATA.ELEMENT[irow] == ""){
          df[irow,fillCols] <- df[irow-1,fillCols]
          irow <- irow + 1
        }
        irow <- irow + 1
      }
      return(df)
    }
    dataDict <- dataDict %>% fillDown(c(1:5,8))
    #    write.csv(dataDict,file="CollegeScorecardDataDictionary-09-08-2015_REVISED.csv",
    #              row.names = FALSE,na="")
  } else {
    dataDict <- read.csv(file="../input/CollegeScorecardDataDictionary-09-12-2015_REVISED.csv",
                         stringsAsFactors = FALSE)
  }
}
##             NAME.OF.DATA.ELEMENT dev.category developer.friendly.name
## 1        Unit ID for institution         root                      id
## 2 8-digit OPE ID for institution         root                 ope8_id
## 3 6-digit OPE ID for institution         root                 ope6_id
## 4               Institution name       school                    name
## 5                           City       school                    city
## 6                 State postcode       school                   state
##   API.data.type VARIABLE.NAME VALUE LABEL SOURCE NOTES
## 1       integer        UNITID    NA        IPEDS      
## 2       integer         OPEID    NA        IPEDS      
## 3       integer        opeid6    NA        IPEDS      
## 4  autocomplete        INSTNM    NA        IPEDS      
## 5  autocomplete          CITY    NA        IPEDS      
## 6        string        STABBR    NA        IPEDS      
## 'data.frame':    1953 obs. of  9 variables:
##  $ NAME.OF.DATA.ELEMENT   : chr  "Unit ID for institution" "8-digit OPE ID for institution" "6-digit OPE ID for institution" "Institution name" ...
##  $ dev.category           : chr  "root" "root" "root" "school" ...
##  $ developer.friendly.name: chr  "id" "ope8_id" "ope6_id" "name" ...
##  $ API.data.type          : chr  "integer" "integer" "integer" "autocomplete" ...
##  $ VARIABLE.NAME          : chr  "UNITID" "OPEID" "opeid6" "INSTNM" ...
##  $ VALUE                  : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ LABEL                  : chr  "" "" "" "" ...
##  $ SOURCE                 : chr  "IPEDS" "IPEDS" "IPEDS" "IPEDS" ...
##  $ NOTES                  : chr  "" "" "" "" ...
##  [1] "Id"             "C200_4"         "C200_L4"        "D200_4"        
##  [5] "D200_L4"        "C200_4_POOLED"  "C200_L4_POOLED" "poolyrs200"    
##  [9] "D200_4_POOLED"  "D200_L4_POOLED" "Year"          
## [1] ""
print(head(dataDict))
##             NAME.OF.DATA.ELEMENT dev.category developer.friendly.name
## 1        Unit ID for institution         root                      id
## 2 8-digit OPE ID for institution         root                 ope8_id
## 3 6-digit OPE ID for institution         root                 ope6_id
## 4               Institution name       school                    name
## 5                           City       school                    city
## 6                 State postcode       school                   state
##   API.data.type VARIABLE.NAME VALUE LABEL SOURCE NOTES
## 1       integer        UNITID    NA        IPEDS      
## 2       integer         OPEID    NA        IPEDS      
## 3       integer        opeid6    NA        IPEDS      
## 4  autocomplete        INSTNM    NA        IPEDS      
## 5  autocomplete          CITY    NA        IPEDS      
## 6        string        STABBR    NA        IPEDS
print(dataDict %>% tbl_df)
## # A tibble: 1,953 x 9
##                          NAME.OF.DATA.ELEMENT dev.category
##                                         <chr>        <chr>
##  1                    Unit ID for institution         root
##  2             8-digit OPE ID for institution         root
##  3             6-digit OPE ID for institution         root
##  4                           Institution name       school
##  5                                       City       school
##  6                             State postcode       school
##  7                                   ZIP code       school
##  8                 Accreditor for institution       school
##  9             URL for institution's homepage       school
## 10 URL for institution's net price calculator       school
## # ... with 1,943 more rows, and 7 more variables:
## #   developer.friendly.name <chr>, API.data.type <chr>,
## #   VARIABLE.NAME <chr>, VALUE <int>, LABEL <chr>, SOURCE <chr>,
## #   NOTES <chr>

Probabilistic Discrete Choice Modeling

A key concept in the probabilistic Bayesian modeling framework I’ve chosen is to represent each college by a set of variables quantifying the types of students who populate their campus. This requires turning as many of the provided variables as possible into conditional probability distributions such as P(student-income > x | college = school-of-interest). From these conditional probabilities,

I then computed Bayes Factors: log10(P(student-income > x | college = school-of-interest)/P(student-income > x | college != school-of-interest)) = Posterior-Odds(college = school-of-interest | student-income>x)/Prior-Odds(college = school-of-interest).

These Bayes factors then serve as the covariates in the linear utility function of the discrete choice model: u(school-of-interest | student) = beta(student) * Bayes-factors(school-of-interest). (See Nobel Laureate for Economics Daniel McFadden’s contributions in discrete choice modeling.)

Then the student can just rank-order the colleges by their utilities u(school-of-interest | student).

Now, there are many alternatives to this framework. I chose this one because of its flexibility and my experience using such models. And even though I don’t do a rigorous data-estimated hierarchical model using tools such as Stan through package rstan, the promise of the framework is hinted at in the flexibility of the subsequent analyses.

Let’s get on with the work!

Grab the requisite datasets: Used years 2005 (for the Treasury data on students’ originating zip codes) and 2013.

# This groups the disciplines (degrees) into reasonably similar sets.
#discgrp <- read.csv("discgrp.csv", stringsAsFactors=FALSE)
discgrp <- data_frame(LABEL=c("Agriculture, Agriculture Operations, and Related Sciences","Natural Resources and Conservation",
                              "Architecture and Related Services","Area, Ethnic, Cultural, Gender, and Group Studies",
                              "Communication, Journalism, and Related Programs","Communications Technologies/Technicians and Support Services",
                              "Computer and Information Sciences and Support Services","Personal and Culinary Services",
                              "Education","Engineering","Engineering Technologies and Engineering-Related Fields",
                              "Foreign Languages, Literatures, and Linguistics","Family and Consumer Sciences/Human Sciences",
                              "Legal Professions and Studies","English Language and Literature/Letters",
                              "Liberal Arts and Sciences, General Studies and Humanities","Library Science","Biological and Biomedical Sciences",
                              "Mathematics and Statistics","Military Technologies and Applied Sciences","Multi/Interdisciplinary Studies",
                              "Parks, Recreation, Leisure, and Fitness Studies","Philosophy and Religious Studies","Theology and Religious Vocations",
                              "Physical Sciences","Science Technologies/Technicians","Psychology",
                              "Homeland Security, Law Enforcement, Firefighting and Related Protective Services",
                              "Public Administration and Social Service Professions","Social Sciences","Construction Trades",
                              "Mechanic and Repair Technologies/Technicians","Precision Production","Transportation and Materials Moving",
                              "Visual and Performing Arts","Health Professions and Related Programs",
                              "Business, Management, Marketing, and Related Support Services","History"),
                      discrp=c(2,4,2,5,6,4,1,4,5,1,1,5,5,7,5,5,5,2,1,4,5,4,5,5,2,3,3,4,3,3,4,4,4,4,6,2,7,5),
                      VARIABLE.NAME=c("PCIP01","PCIP03","PCIP04","PCIP05","PCIP09","PCIP10","PCIP11","PCIP12","PCIP13","PCIP14","PCIP15","PCIP16",
                                      "PCIP19","PCIP22","PCIP23","PCIP24","PCIP25","PCIP26","PCIP27","PCIP29","PCIP30","PCIP31","PCIP38","PCIP39","PCIP40",
                                      "PCIP41","PCIP42","PCIP43","PCIP44","PCIP45","PCIP46","PCIP47","PCIP48","PCIP49","PCIP50","PCIP51","PCIP52","PCIP54"))

Only deal with colleges having these traits:

  1. Public or Private non-profit
    • Drops Private for-profit schools
  2. Currently Operating
  3. Fully accreditted
  4. Predominantly Bachelor’s degree granting
  5. Within the 50 states of the U.S.A.
  6. Not a distance-only college
  7. Not a theological school

Now the challenge is to match the available database columns to the criteria I wish to calculate.

fieldNames <-
  c(
    'UGDS','UGDS_WHITE','UGDS_BLACK','UGDS_HISP','UGDS_ASIAN','UGDS_AIAN','UGDS_NHPI',
    'UGDS_2MOR','UGDS_NRA','UGDS_UNKN',"UG25abv","INC_PCT_LO","DEP_STAT_PCT_IND","DEP_INC_PCT_LO","IND_INC_PCT_LO",
    "DEP_INC_PCT_M1","DEP_INC_PCT_M2","DEP_INC_PCT_H1","IND_INC_PCT_M1","IND_INC_PCT_M2","IND_INC_PCT_H1","IND_INC_PCT_H2",
    "DEP_INC_PCT_H2","PAR_ED_PCT_1STGEN","pct_white","pct_black","pct_asian","pct_hispanic","pct_ba","pct_grad_prof",
    "pct_born_us","median_hh_inc","poverty_rate","unemp_rate","ln_median_hh_inc","pell_ever","female","fsend_1","fsend_2",
    "fsend_3","fsend_4","fsend_5","INC_PCT_LO","INC_PCT_M1","INC_PCT_M2","INC_PCT_H1","INC_PCT_H2","LATITUDE",
    "LONGITUDE","ZIP","STABBR","region","LOCALE","CCSIZSET","DISTANCEONLY","RELAFFIL","CURROPER","NPT4_PUB",
    "NPT4_PRIV","NPT41_PUB","NPT42_PUB","NPT43_PUB","NPT44_PUB","NPT45_PUB","NPT41_PRIV","NPT42_PRIV","NPT43_PRIV",
    "NPT44_PRIV","NPT45_PRIV","TUITIONFEE_IN","TUITIONFEE_OUT","PCTPELL","ADM_RATE","RET_FT4","SATVR25","SATVR75",
    "SATMT25","SATMT75","SATWR25","SATWR75","SATVRMID","SATMTMID","SATWRMID","ACTCM25","ACTCM75","ACTEN25","ACTEN75",
    "ACTMT25","ACTMT75","ACTWR25","ACTWR75","ACTCMMID","ACTENMID","ACTMTMID","ACTWRMID","SAT_AVG","SAT_AVG_ALL",
    "PCIP01","PCIP03","PCIP04","PCIP05","PCIP09","PCIP10","PCIP11","PCIP12",
    "PCIP13","PCIP14","PCIP15","PCIP16","PCIP19","PCIP22","PCIP23","PCIP24","PCIP25","PCIP26","PCIP27","PCIP29","PCIP30",
    "PCIP31","PCIP38","PCIP39","PCIP40","PCIP41","PCIP42","PCIP43","PCIP44","PCIP45","PCIP46","PCIP47","PCIP48","PCIP49",
    "PCIP50","PCIP51","PCIP52","PCIP54","C150_4_WHITE","C150_4_BLACK","C150_4_HISP","C150_4_ASIAN","ENRL_ORIG_YR2_RT",
    "LO_INC_ENRL_ORIG_YR2_RT","MD_INC_ENRL_ORIG_YR2_RT","HI_INC_ENRL_ORIG_YR2_RT","DEP_ENRL_ORIG_YR2_RT","IND_ENRL_ORIG_YR2_RT",
    "FEMALE_ENRL_ORIG_YR2_RT","MALE_ENRL_ORIG_YR2_RT","PELL_ENRL_ORIG_YR2_RT","NOPELL_ENRL_ORIG_YR2_RT","LOAN_ENRL_ORIG_YR2_RT",
    "NOLOAN_ENRL_ORIG_YR2_RT","FIRSTGEN_ENRL_ORIG_YR2_RT","NOT1STGEN_ENRL_ORIG_YR2_RT","ENRL_ORIG_YR3_RT","LO_INC_ENRL_ORIG_YR3_RT",
    "MD_INC_ENRL_ORIG_YR3_RT","HI_INC_ENRL_ORIG_YR3_RT","DEP_ENRL_ORIG_YR3_RT","IND_ENRL_ORIG_YR3_RT","FEMALE_ENRL_ORIG_YR3_RT",
    "MALE_ENRL_ORIG_YR3_RT","PELL_ENRL_ORIG_YR3_RT","NOPELL_ENRL_ORIG_YR3_RT","LOAN_ENRL_ORIG_YR3_RT","NOLOAN_ENRL_ORIG_YR3_RT",
    "FIRSTGEN_ENRL_ORIG_YR3_RT","NOT1STGEN_ENRL_ORIG_YR3_RT","ENRL_ORIG_YR4_RT","LO_INC_ENRL_ORIG_YR4_RT","MD_INC_ENRL_ORIG_YR4_RT",
    "HI_INC_ENRL_ORIG_YR4_RT","DEP_ENRL_ORIG_YR4_RT","IND_ENRL_ORIG_YR4_RT","FEMALE_ENRL_ORIG_YR4_RT","MALE_ENRL_ORIG_YR4_RT",
    "PELL_ENRL_ORIG_YR4_RT","NOPELL_ENRL_ORIG_YR4_RT","LOAN_ENRL_ORIG_YR4_RT","NOLOAN_ENRL_ORIG_YR4_RT","FIRSTGEN_ENRL_ORIG_YR4_RT",
    "NOT1STGEN_ENRL_ORIG_YR4_RT","ENRL_ORIG_YR6_RT","LO_INC_ENRL_ORIG_YR6_RT","MD_INC_ENRL_ORIG_YR6_RT","HI_INC_ENRL_ORIG_YR6_RT",
    "DEP_ENRL_ORIG_YR6_RT","IND_ENRL_ORIG_YR6_RT","FEMALE_ENRL_ORIG_YR6_RT","MALE_ENRL_ORIG_YR6_RT","PELL_ENRL_ORIG_YR6_RT",
    "NOPELL_ENRL_ORIG_YR6_RT","LOAN_ENRL_ORIG_YR6_RT","NOLOAN_ENRL_ORIG_YR6_RT","FIRSTGEN_ENRL_ORIG_YR6_RT","NOT1STGEN_ENRL_ORIG_YR6_RT",
    "ENRL_ORIG_YR8_RT","LO_INC_ENRL_ORIG_YR8_RT","MD_INC_ENRL_ORIG_YR8_RT","HI_INC_ENRL_ORIG_YR8_RT","DEP_ENRL_ORIG_YR8_RT",
    "IND_ENRL_ORIG_YR8_RT","FEMALE_ENRL_ORIG_YR8_RT","MALE_ENRL_ORIG_YR8_RT","PELL_ENRL_ORIG_YR8_RT","NOPELL_ENRL_ORIG_YR8_RT",
    "LOAN_ENRL_ORIG_YR8_RT","NOLOAN_ENRL_ORIG_YR8_RT","FIRSTGEN_ENRL_ORIG_YR8_RT","NOT1STGEN_ENRL_ORIG_YR8_RT","C150_4_POOLED_SUPP",
    "PCTFLOAN","DEBT_MDN","GRAD_DEBT_MDN","WDRAW_DEBT_MDN","LO_INC_DEBT_MDN","MD_INC_DEBT_MDN","HI_INC_DEBT_MDN","DEP_DEBT_MDN",
    "IND_DEBT_MDN","PELL_DEBT_MDN","NOPELL_DEBT_MDN","CUML_DEBT_N","CUML_DEBT_P90","CUML_DEBT_P75","CUML_DEBT_P25",
    "CUML_DEBT_P10","loan_ever","DEBT_MDN_SUPP","GRAD_DEBT_MDN_SUPP","GRAD_DEBT_MDN10YR_SUPP","CDR2","CDR3","COMPL_RPY_1YR_RT",
    "NONCOM_RPY_1YR_RT","LO_INC_RPY_1YR_RT","MD_INC_RPY_1YR_RT","HI_INC_RPY_1YR_RT","DEP_RPY_1YR_RT","IND_RPY_1YR_RT",
    "PELL_RPY_1YR_RT","NOPELL_RPY_1YR_RT","FEMALE_RPY_1YR_RT","MALE_RPY_1YR_RT","FIRSTGEN_RPY_1YR_RT","NOTFIRSTGEN_RPY_1YR_RT",
    "RPY_3YR_RT","COMPL_RPY_3YR_RT","NONCOM_RPY_3YR_RT","LO_INC_RPY_3YR_RT","MD_INC_RPY_3YR_RT","HI_INC_RPY_3YR_RT","DEP_RPY_3YR_RT",
    "IND_RPY_3YR_RT","PELL_RPY_3YR_RT","NOPELL_RPY_3YR_RT","FEMALE_RPY_3YR_RT","MALE_RPY_3YR_RT","FIRSTGEN_RPY_3YR_RT",
    "NOTFIRSTGEN_RPY_3YR_RT","RPY_5YR_RT","COMPL_RPY_5YR_RT","NONCOM_RPY_5YR_RT","LO_INC_RPY_5YR_RT","MD_INC_RPY_5YR_RT",
    "HI_INC_RPY_5YR_RT","DEP_RPY_5YR_RT","IND_RPY_5YR_RT","PELL_RPY_5YR_RT","NOPELL_RPY_5YR_RT","FEMALE_RPY_5YR_RT",
    "MALE_RPY_5YR_RT","FIRSTGEN_RPY_5YR_RT","NOTFIRSTGEN_RPY_5YR_RT","RPY_7YR_RT","COMPL_RPY_7YR_RT","NONCOM_RPY_7YR_RT",
    "LO_INC_RPY_7YR_RT","MD_INC_RPY_7YR_RT","HI_INC_RPY_7YR_RT","DEP_RPY_7YR_RT","IND_RPY_7YR_RT","PELL_RPY_7YR_RT",
    "NOPELL_RPY_7YR_RT","FEMALE_RPY_7YR_RT","MALE_RPY_7YR_RT","FIRSTGEN_RPY_7YR_RT","NOTFIRSTGEN_RPY_7YR_RT","count_ed",
    "count_nwne_p6","count_wne_p6","mn_earn_wne_p6","md_earn_wne_p6","pct10_earn_wne_p6","pct25_earn_wne_p6",
    "pct75_earn_wne_p6","pct90_earn_wne_p6","sd_earn_wne_p6","gt_25k_p6","mn_earn_wne_inc1_p6","mn_earn_wne_inc2_p6",
    "mn_earn_wne_inc3_p6"
  )
# This dropped name set is part of the base query string...
fieldNames <- setdiff(fieldNames,c('CITY','LONGITUDE','LATITUDE','ZIP','region','LOCALE','CURROPER')) 
fieldNames <- grep('PCIP',fieldNames,value = TRUE,invert = TRUE) # drop the discipline fields; they're grabbed separately.

# Upper case variables are from 2013 and lower case variables are from the Treasury dataset of 2005, except 'region'.
# fromS10 <- grep('ENRL_ORIG',fieldNames)
fromS11 <- which(fieldNames %in% grep('^[A-Z]|region',fieldNames))
fromS05 <- setdiff(seq_along(fieldNames),fromS11)
dfNames <- ifelse(grepl('^[A-Z]|region',fieldNames),fieldNames,paste(fieldNames,"2005",sep="_")) # put a '_2005' suffix on the Treasury variables.

makeQuery <- function(year,fieldNames,dfNames) {
  paste("SELECT UNITID unitID,
        INSTNM College,
        CONTROL CollegeType,
        PREDDEG degree,
        CURROPER currop,
        DISTANCEONLY distance,
        RELAFFIL relaffil,
        CITY city,
        LONGITUDE lon,
        LATITUDE  lat,
        ZIP zip,
        st_fips state,
        region region,
        LOCALE locale,
        CCBASIC ccbasic,
        CCUGPROF ccugprof,
        Year Year,",
        paste(fieldNames,' ',dfNames,sep="",collapse = ","),
        "FROM Scorecard 
        WHERE Year=",year)
}

queryString2005 <- makeQuery(2005,fieldNames,dfNames)
queryString2013 <- makeQuery(2013,fieldNames,dfNames)


discNames <- gsub('^([A-Z][-a-z]+)[, and/]*([A-Z][a-z]+).*','\\1\\2',discgrp$LABEL)
discgrp %<>% tbl_df %>% mutate(discName = discNames)
queryStringDscplns <- makeQuery(2013,discgrp$VARIABLE.NAME,discNames)

# Retrieve the data for 2005 because that is the latest with Treasury data on student families.
student2005 <- dbGetQuery(db,queryString2005)
## Warning in rsqlite_fetch(res@ptr, n = n): Column `zip`: mixed type, first
## seen values of type integer, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `INC_PCT_LO`: mixed type,
## first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_STAT_PCT_IND`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_INC_PCT_LO`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `IND_INC_PCT_LO`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_INC_PCT_M1`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_INC_PCT_M2`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_INC_PCT_H1`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `IND_INC_PCT_M1`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `IND_INC_PCT_M2`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `IND_INC_PCT_H1`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `IND_INC_PCT_H2`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_INC_PCT_H2`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `PAR_ED_PCT_1STGEN`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `pct_white_2005`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `pct_black_2005`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `pct_asian_2005`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `pct_hispanic_2005`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `pct_ba_2005`: mixed type,
## first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `pct_grad_prof_2005`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `pct_born_us_2005`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `median_hh_inc_2005`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `poverty_rate_2005`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `unemp_rate_2005`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `ln_median_hh_inc_2005`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `pell_ever_2005`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `female_2005`: mixed type,
## first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `INC_PCT_M1`: mixed type,
## first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `INC_PCT_M2`: mixed type,
## first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `INC_PCT_H1`: mixed type,
## first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `INC_PCT_H2`: mixed type,
## first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `ENRL_ORIG_YR2_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `LO_INC_ENRL_ORIG_YR2_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `MD_INC_ENRL_ORIG_YR2_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `HI_INC_ENRL_ORIG_YR2_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_ENRL_ORIG_YR2_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `IND_ENRL_ORIG_YR2_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `FEMALE_ENRL_ORIG_YR2_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `MALE_ENRL_ORIG_YR2_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `PELL_ENRL_ORIG_YR2_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NOPELL_ENRL_ORIG_YR2_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `LOAN_ENRL_ORIG_YR2_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NOLOAN_ENRL_ORIG_YR2_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column
## `FIRSTGEN_ENRL_ORIG_YR2_RT`: mixed type, first seen values of type real,
## coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column
## `NOT1STGEN_ENRL_ORIG_YR2_RT`: mixed type, first seen values of type real,
## coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `ENRL_ORIG_YR3_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `LO_INC_ENRL_ORIG_YR3_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `MD_INC_ENRL_ORIG_YR3_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `HI_INC_ENRL_ORIG_YR3_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_ENRL_ORIG_YR3_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `IND_ENRL_ORIG_YR3_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `FEMALE_ENRL_ORIG_YR3_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `MALE_ENRL_ORIG_YR3_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `PELL_ENRL_ORIG_YR3_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NOPELL_ENRL_ORIG_YR3_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `LOAN_ENRL_ORIG_YR3_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NOLOAN_ENRL_ORIG_YR3_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column
## `FIRSTGEN_ENRL_ORIG_YR3_RT`: mixed type, first seen values of type real,
## coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column
## `NOT1STGEN_ENRL_ORIG_YR3_RT`: mixed type, first seen values of type real,
## coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `ENRL_ORIG_YR4_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `LO_INC_ENRL_ORIG_YR4_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `MD_INC_ENRL_ORIG_YR4_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `HI_INC_ENRL_ORIG_YR4_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_ENRL_ORIG_YR4_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `IND_ENRL_ORIG_YR4_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `FEMALE_ENRL_ORIG_YR4_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `MALE_ENRL_ORIG_YR4_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `PELL_ENRL_ORIG_YR4_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NOPELL_ENRL_ORIG_YR4_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `LOAN_ENRL_ORIG_YR4_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NOLOAN_ENRL_ORIG_YR4_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column
## `FIRSTGEN_ENRL_ORIG_YR4_RT`: mixed type, first seen values of type real,
## coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column
## `NOT1STGEN_ENRL_ORIG_YR4_RT`: mixed type, first seen values of type real,
## coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `ENRL_ORIG_YR6_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `LO_INC_ENRL_ORIG_YR6_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `MD_INC_ENRL_ORIG_YR6_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `HI_INC_ENRL_ORIG_YR6_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_ENRL_ORIG_YR6_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `IND_ENRL_ORIG_YR6_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `FEMALE_ENRL_ORIG_YR6_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `MALE_ENRL_ORIG_YR6_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `PELL_ENRL_ORIG_YR6_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NOPELL_ENRL_ORIG_YR6_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `LOAN_ENRL_ORIG_YR6_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NOLOAN_ENRL_ORIG_YR6_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column
## `FIRSTGEN_ENRL_ORIG_YR6_RT`: mixed type, first seen values of type real,
## coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column
## `NOT1STGEN_ENRL_ORIG_YR6_RT`: mixed type, first seen values of type real,
## coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `ENRL_ORIG_YR8_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `LO_INC_ENRL_ORIG_YR8_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `MD_INC_ENRL_ORIG_YR8_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `HI_INC_ENRL_ORIG_YR8_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_ENRL_ORIG_YR8_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `IND_ENRL_ORIG_YR8_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `FEMALE_ENRL_ORIG_YR8_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `MALE_ENRL_ORIG_YR8_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `PELL_ENRL_ORIG_YR8_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NOPELL_ENRL_ORIG_YR8_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `LOAN_ENRL_ORIG_YR8_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NOLOAN_ENRL_ORIG_YR8_RT`:
## mixed type, first seen values of type string, coercing other values of type
## real
## Warning in rsqlite_fetch(res@ptr, n = n): Column
## `FIRSTGEN_ENRL_ORIG_YR8_RT`: mixed type, first seen values of type string,
## coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column
## `NOT1STGEN_ENRL_ORIG_YR8_RT`: mixed type, first seen values of type string,
## coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEBT_MDN`: mixed type,
## first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `GRAD_DEBT_MDN`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `WDRAW_DEBT_MDN`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `LO_INC_DEBT_MDN`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `MD_INC_DEBT_MDN`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `HI_INC_DEBT_MDN`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_DEBT_MDN`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `IND_DEBT_MDN`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `PELL_DEBT_MDN`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NOPELL_DEBT_MDN`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `CUML_DEBT_N`: mixed type,
## first seen values of type integer, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `CUML_DEBT_P90`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `CUML_DEBT_P75`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `CUML_DEBT_P25`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `CUML_DEBT_P10`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `loan_ever_2005`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEBT_MDN_SUPP`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `GRAD_DEBT_MDN_SUPP`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `GRAD_DEBT_MDN10YR_SUPP`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `count_nwne_p6_2005`:
## mixed type, first seen values of type integer, coercing other values of
## type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `count_wne_p6_2005`: mixed
## type, first seen values of type integer, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `mn_earn_wne_p6_2005`:
## mixed type, first seen values of type integer, coercing other values of
## type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `md_earn_wne_p6_2005`:
## mixed type, first seen values of type integer, coercing other values of
## type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `pct10_earn_wne_p6_2005`:
## mixed type, first seen values of type integer, coercing other values of
## type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `pct25_earn_wne_p6_2005`:
## mixed type, first seen values of type integer, coercing other values of
## type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `pct75_earn_wne_p6_2005`:
## mixed type, first seen values of type integer, coercing other values of
## type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `pct90_earn_wne_p6_2005`:
## mixed type, first seen values of type integer, coercing other values of
## type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `sd_earn_wne_p6_2005`:
## mixed type, first seen values of type integer, coercing other values of
## type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `gt_25k_p6_2005`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column
## `mn_earn_wne_inc1_p6_2005`: mixed type, first seen values of type real,
## coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column
## `mn_earn_wne_inc2_p6_2005`: mixed type, first seen values of type real,
## coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column
## `mn_earn_wne_inc3_p6_2005`: mixed type, first seen values of type real,
## coercing other values of type string
# Retrieve the data for 2013 because that's the latest data.
student2013 <- dbGetQuery(db,queryString2013)
## Warning in rsqlite_fetch(res@ptr, n = n): Column `zip`: mixed type, first
## seen values of type integer, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `INC_PCT_LO`: mixed type,
## first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_STAT_PCT_IND`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_INC_PCT_LO`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `IND_INC_PCT_LO`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_INC_PCT_M1`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_INC_PCT_M2`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_INC_PCT_H1`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `IND_INC_PCT_M1`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `IND_INC_PCT_M2`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `IND_INC_PCT_H1`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `IND_INC_PCT_H2`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_INC_PCT_H2`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `PAR_ED_PCT_1STGEN`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `INC_PCT_M1`: mixed type,
## first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `INC_PCT_M2`: mixed type,
## first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `INC_PCT_H1`: mixed type,
## first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `INC_PCT_H2`: mixed type,
## first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `C150_4_POOLED_SUPP`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEBT_MDN`: mixed type,
## first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `GRAD_DEBT_MDN`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `WDRAW_DEBT_MDN`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `LO_INC_DEBT_MDN`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `MD_INC_DEBT_MDN`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `HI_INC_DEBT_MDN`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_DEBT_MDN`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `IND_DEBT_MDN`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `PELL_DEBT_MDN`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NOPELL_DEBT_MDN`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `CUML_DEBT_N`: mixed type,
## first seen values of type integer, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `CUML_DEBT_P90`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `CUML_DEBT_P75`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `CUML_DEBT_P25`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `CUML_DEBT_P10`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEBT_MDN_SUPP`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `GRAD_DEBT_MDN_SUPP`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `GRAD_DEBT_MDN10YR_SUPP`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `RPY_3YR_RT`: mixed type,
## first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `COMPL_RPY_3YR_RT`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NONCOM_RPY_3YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `LO_INC_RPY_3YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `MD_INC_RPY_3YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `HI_INC_RPY_3YR_RT`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_RPY_3YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `IND_RPY_3YR_RT`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `PELL_RPY_3YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NOPELL_RPY_3YR_RT`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `FEMALE_RPY_3YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `MALE_RPY_3YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `FIRSTGEN_RPY_3YR_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NOTFIRSTGEN_RPY_3YR_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `RPY_5YR_RT`: mixed type,
## first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `COMPL_RPY_5YR_RT`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NONCOM_RPY_5YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `LO_INC_RPY_5YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `MD_INC_RPY_5YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `HI_INC_RPY_5YR_RT`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_RPY_5YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `IND_RPY_5YR_RT`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `PELL_RPY_5YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NOPELL_RPY_5YR_RT`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `FEMALE_RPY_5YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `MALE_RPY_5YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `FIRSTGEN_RPY_5YR_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NOTFIRSTGEN_RPY_5YR_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `RPY_7YR_RT`: mixed type,
## first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `COMPL_RPY_7YR_RT`: mixed
## type, first seen values of type string, coercing other values of type real
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NONCOM_RPY_7YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `LO_INC_RPY_7YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `MD_INC_RPY_7YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `HI_INC_RPY_7YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `DEP_RPY_7YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `IND_RPY_7YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `PELL_RPY_7YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NOPELL_RPY_7YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `FEMALE_RPY_7YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `MALE_RPY_7YR_RT`: mixed
## type, first seen values of type real, coercing other values of type string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `FIRSTGEN_RPY_7YR_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
## Warning in rsqlite_fetch(res@ptr, n = n): Column `NOTFIRSTGEN_RPY_7YR_RT`:
## mixed type, first seen values of type real, coercing other values of type
## string
disciplines2013 <- dbGetQuery(db,queryStringDscplns)
## Warning in rsqlite_fetch(res@ptr, n = n): Column `zip`: mixed type, first
## seen values of type integer, coercing other values of type string
# Disconnect the database.
dbDisconnect(db)

#*** 2016/06/11: Discovered that Princeton University has "0" for "ComputerInformation"!
disciplines2013 %>% 
    filter( College == "Princeton University" ) %>% 
    select( 1:2, ComputerInformation, Engineering ) %>% 
    print
##   unitID              College ComputerInformation Engineering
## 1 186131 Princeton University                   0      0.1959
# Based on this web page (https://www.cs.princeton.edu/people/ugrad), I estimated ~450 computer science majors 
# out of the 5234 undergraduates enrolled at Princeton.# And since the Computer Science dept. is in the 
# Engineering school, I'll set "ComputerInformation" to 450/5234=8.6% and subtract that same amount from "Engineering".
csprop <- 0.086
engg   <-  disciplines2013[ disciplines2013$College == "Princeton University", "Engineering" ][[ 1 ]]
disciplines2013[ disciplines2013$College == "Princeton University", c( "ComputerInformation", "Engineering") ] <- c( csprop, engg - csprop )
disciplines2013 %>% 
    filter( College == "Princeton University" ) %>% 
    select( 1:2, ComputerInformation, Engineering ) %>% 
    print
##   unitID              College ComputerInformation Engineering
## 1 186131 Princeton University               0.086      0.1099
# Join the years together.
student2013 <- student2013[!sapply(student2013,function(x) all(is.na(x)))] %>% tbl_df
student2005 <- student2005[!sapply(student2005,function(x) all(is.na(x)))] %>% tbl_df
# student <- student2013 %>% tbl_df %>% 
#   dplyr::select(-contains('_2005')) %>% 
#   inner_join(student2005 %>% dplyr::select(unitID,contains('_2005')), by = "unitID")
student <- student2013 %>% 
  inner_join(student2005 %>% dplyr::select(unitID,contains('_2005')), by = "unitID")
# Add the dominant disciplines
discplns <- disciplines2013 %>% tbl_df %>%
  filter(unitID %in% student$unitID) %>%
  mutate(irow = seq_len(nrow(.))) %>%
  gather(key=Discipline,value=Proportion,one_of(discNames)) %>%
  group_by(irow) %>%
  arrange(desc(Proportion)) %>%
  summarize(
    unitID   = first(unitID),
    College  = first(College),
    domDisc1 = first(Discipline),
    pctDisc1 = 100*first(Proportion),
    domDisc2 = nth(Discipline,n=2),
    pctDisc2 = 100*nth(Proportion,n=2)
  ) %>%
  mutate(
    isSpecialty = pctDisc1 > 70,
    isTheological = grepl('Theolo|Relig',domDisc1) | grepl('Theolo|Relig',domDisc2)
  )

# CALLED THE MAIN DATA TABLE `student` BECAUSE ORIGINALLY ONLY CONTAINED
# `student` CATEGORY OF DATA FIELDS. NOW THIS LEGACY NAME AWKWARDLY CONFUSES
# THIS DATA TABLE OF UNIVERSITIES/COLLEGES WITH THE OTHER DATA OBJECTS
# CONTAINING SPECIFIC STUDENT PROPERTIES -- SEE LATER...

# Now add the disciplines as grouped
student %<>% 
  inner_join(discplns %>% dplyr::select(-College),by='unitID')

usa.states.dc <- c("District of Columbia",state.name)
student0 <- student

NOTE: At this point the foundational data table is student0, which is further manipulated as student. This includes filtering out colleges we’re not interested in. See the printouts below to get a sense of what data were kept.

# Filter down to just the colleges meeting the following criteria:
student  <- student0 %>% 
  filter(CollegeType != 'Private for-profit',
         currop      == 'Currently certified as operating',
         distance    == 'Not distance-education only',
         degree      == "Predominantly bachelor's-degree granting",
         as.character(region)      != 'U.S. Service Schools',
         !grepl('^Associate',ccbasic),
         state %in% usa.states.dc,
         !is.na(ccbasic),
         !is.na(UGDS),
         !is.na(pell_ever_2005),
         !isTheological,
         !is.na(mn_earn_wne_inc2_p6_2005))

# Turn character columns into factors.
student <- student %>% lapply(
  function(x) 
    if(is.character(x) | is.factor(x) | is.logical(x)) {
      if('PrivacySuppressed' %in% x) {
        tmp<-x;tmp[x=='PrivacySuppressed']<-NA;as.numeric(tmp)
      } else {
        factor(gsub("’","'",x))
      }
    } else as.numeric(x)) %>% 
  as.data.frame %>% tbl_df

# add a religious affiliation indicator variable:
student %<>% 
  mutate(relaffilFLAG=factor(ifelse(is.na(relaffil),
                                    FALSE,
                                    !grepl('^[Nn]ot ',as.character(relaffil)))))


a<- student %>% dplyr::select(grep('^(SAT|ACT)',names(.))) %>%  apply(1,function(x) !all(is.na(x)))
student %<>% mutate(stdzdtest=a)

show(dim(student))
## [1] 1514  198
#show(summary(student))

Drop all the Penn St. and U. of Pitt. schools except the main campus because same 2005 Treasury data entered in all rows for each, resp.

stdttmp <- student %>% filter((College %in% c('Pennsylvania State University-Main Campus',
                                              'University of Pittsburgh-Pittsburgh Campus')) |
                                !(College %in% grep('Pennsylvania State University|University of Pittsburgh',
                                                    College,value=TRUE)))
show(setdiff(student$College,stdttmp$College))
##  [1] "Pennsylvania State University-Penn State Erie-Behrend College"
##  [2] "Pennsylvania State University-Penn State New Kensington"      
##  [3] "Pennsylvania State University-Penn State Shenango"            
##  [4] "Pennsylvania State University-Penn State Wilkes-Barre"        
##  [5] "Pennsylvania State University-Penn State Worthington Scranton"
##  [6] "Pennsylvania State University-Penn State Lehigh Valley"       
##  [7] "Pennsylvania State University-Penn State Altoona"             
##  [8] "Pennsylvania State University-Penn State Beaver"              
##  [9] "Pennsylvania State University-Penn State Berks"               
## [10] "Pennsylvania State University-Penn State Harrisburg"          
## [11] "Pennsylvania State University-Penn State Brandywine"          
## [12] "Pennsylvania State University-Penn State Greater Allegheny"   
## [13] "Pennsylvania State University-Penn State Abington"            
## [14] "Pennsylvania State University-Penn State Schuylkill"          
## [15] "Pennsylvania State University-Penn State York"                
## [16] "University of Pittsburgh-Bradford"                            
## [17] "University of Pittsburgh-Greensburg"                          
## [18] "University of Pittsburgh-Johnstown"
show(grep('Pennsylvania State University|University of Pittsburgh',stdttmp$College,value=TRUE))
## [1] "Pennsylvania State University-Main Campus" 
## [2] "University of Pittsburgh-Pittsburgh Campus"
student <- stdttmp
rm(stdttmp)
student <- student[!sapply(student,function(x) length(which(is.na(x)))>1000)]
# Drop schools without earnings data:
student %>% filter(is.na(pct10_earn_wne_p6_2005) | pct10_earn_wne_p6_2005==0) %>% 
  dplyr::select(unitID,College,UGDS,matches('pct.+earn')) %>% 
  print(n=60)
## # A tibble: 58 x 7
##    unitID                                                      College
##     <dbl>                                                       <fctr>
##  1 102395                                 United States Sports Academy
##  2 106306                                     Arkansas Baptist College
##  3 110316                     California Institute of Integral Studies
##  4 112570                                   Columbia College-Hollywood
##  5 116712                                    John F Kennedy University
##  6 116846                                   American Jewish University
##  7 117168                             Laguna College of Art and Design
##  8 119058                  Monterey Institute of International Studies
##  9 120768                                         Pacific Oaks College
## 10 120838                                    Pacific States University
## 11 122506                          San Francisco Conservatory of Music
## 12 123633                                       South Baylo University
## 13 123952                Southern California Institute of Architecture
## 14 124292                                       Thomas Aquinas College
## 15 143297                           Blessing Rieman College of Nursing
## 16 146533                                  Lakeview College of Nursing
## 17 147590                       National University of Health Sciences
## 18 148575              Saint Francis Medical Center College of Nursing
## 19 148593                                 St John's College of Nursing
## 20 148849                                               Shimer College
## 21 149028                             Saint Anthony College of Nursing
## 22 149639                                  VanderCook College of Music
## 23 153861                           Maharishi University of Management
## 24 160409                                Saint Joseph Seminary College
## 25 160959                                      College of the Atlantic
## 26 164872                                 Boston Architectural College
## 27 164933                                      The Boston Conservatory
## 28 166045                                               Hebrew College
## 29 166869                          MGH Institute of Health Professions
## 30 177038                             Cleveland University-Kansas City
## 31 179450                      Saint Luke's College of Health Sciences
## 32 180878                             Bryan College of Health Sciences
## 33 183275                          Thomas More College of Liberal Arts
## 34 187745       Institute of American Indian and Alaska Native Culture
## 35 190576                   CUNY Graduate School and University Center
## 36 192712                                    Manhattan School of Music
## 37 197221                                               Webb Institute
## 38 198677                                       Heritage Bible College
## 39 201061                                    Art Academy of Cincinnati
## 40 202073                                 Cleveland Institute of Music
## 41 205027                                Pontifical College Josephinum
## 42 207856                            Southwestern Christian University
## 43 209533                              Oregon College of Art and Craft
## 44 211893                                    Curtis Institute of Music
## 45 214971                        Pennsylvania Academy of the Fine Arts
## 46 215053                       Pennsylvania College of Art and Design
## 47 220808                                       Memphis College of Art
## 48 221254                                     O'More College of Design
## 49 223214                 Texas A & M University Health Science Center
## 50 231095                                             Sterling College
## 51 231651                                            Regent University
## 52 233611                                 Southern Virginia University
## 53 238324                                               Bellin College
## 54 243823                                            Parker University
## 55 373827                         New England School of Communications
## 56 385619                                        Everglades University
## 57 416801            The University of Texas MD Anderson Cancer Center
## 58 435000 Louisiana State University Health Sciences Center-Shreveport
## # ... with 5 more variables: UGDS <dbl>, pct10_earn_wne_p6_2005 <dbl>,
## #   pct25_earn_wne_p6_2005 <dbl>, pct75_earn_wne_p6_2005 <dbl>,
## #   pct90_earn_wne_p6_2005 <dbl>
student %<>% 
  filter(!is.na(pct10_earn_wne_p6_2005) & pct10_earn_wne_p6_2005>0) 

#show(summary(student))

Most selective schools according to admissions rate:

# Most selective schools according to admissions rate:
# student %>% 
#   filter(!is.na(ADM_RATE) & ADM_RATE<0.2 & UGDS>600) %>% 
#   dplyr::select(unitID,College,UGDS,matches('SAT|ACT'),ADM_RATE) %>% 
#   arrange(ADM_RATE) %>% 
#   print(n=40)
student %>% 
  filter(!is.na(ADM_RATE) & ADM_RATE<0.2 & UGDS>600) %>% 
  dplyr::select(unitID,College,UGDS,contains('Disc'),ADM_RATE,pell_ever_2005) %>% 
  arrange(ADM_RATE) %>% 
  print(n=40)
## # A tibble: 37 x 9
##    unitID                                             College  UGDS
##     <dbl>                                              <fctr> <dbl>
##  1 243744                                 Stanford University  6980
##  2 166027                                  Harvard University  7278
##  3 130794                                     Yale University  5422
##  4 186131                                Princeton University  5234
##  5 190150         Columbia University in the City of New York  7970
##  6 190372 Cooper Union for the Advancement of Science and Art   851
##  7 166683               Massachusetts Institute of Technology  4510
##  8 144050                               University of Chicago  5697
##  9 217156                                    Brown University  6182
## 10 182670                                   Dartmouth College  4154
## 11 110404                  California Institute of Technology   977
## 12 112260                           Claremont McKenna College  1311
## 13 215062                          University of Pennsylvania 10606
## 14 221999                               Vanderbilt University  6794
## 15 178697                               College of the Ozarks  1513
## 16 133872             Adventist University of Health Sciences  2064
## 17 198419                                     Duke University  6501
## 18 176318                                        Rust College   922
## 19 121345                                      Pomona College  1587
## 20 164465                                     Amherst College  1785
## 21 216287                                  Swarthmore College  1524
## 22 121257                                      Pitzer College  1081
## 23 161004                                     Bowdoin College  1789
## 24 147767                             Northwestern University  8855
## 25 179867                   Washington University in St Louis  6851
## 26 190415                                  Cornell University 14309
## 27 227757                                     Rice University  3920
## 28 131496                               Georgetown University  7261
## 29 168342                                    Williams College  2051
## 30 230959                                  Middlebury College  2466
## 31 110635                   University of California-Berkeley 25951
## 32 162928                            Johns Hopkins University  5932
## 33 115409                                 Harvey Mudd College   803
## 34 234207                       Washington and Lee University  1846
## 35 168148                                    Tufts University  5146
## 36 123961                   University of Southern California 18087
## 37 138716                             Albany State University  3566
## # ... with 6 more variables: domDisc1 <fctr>, pctDisc1 <dbl>,
## #   domDisc2 <fctr>, pctDisc2 <dbl>, ADM_RATE <dbl>, pell_ever_2005 <dbl>
#===================================================

Building Conditional Probabilities & Bayes Factors

What follows are code chunks to generate specific contextual data tables of conditional probabilities and Bayes factors for different aspects of the college experience for a student at each college, as was described above. Note that I fudged the Bayes factors – even after defining the functions to compute them properly. Also, I did a bit of exploration with poLCA, but commented it out. (Using ggplot2, it’s easy to visually explore clustering solutions using ‘small multiples’ (i.e., facet_wrap). But not shown here….)

Ethnicity:

ethnicity <- student %>% 
  dplyr::select(unitID,College,contains("UGDS")) %>%
  mutate(probSchool = UGDS/sum(UGDS),
         totprob    = rowSums(as.matrix(.[4:ncol(.)])))
ethnicity %>% print(n=20)
## # A tibble: 1,438 x 14
##    unitID                             College  UGDS UGDS_WHITE UGDS_BLACK
##     <dbl>                              <fctr> <dbl>      <dbl>      <dbl>
##  1 100654            Alabama A & M University  4051     0.0279     0.9501
##  2 100663 University of Alabama at Birmingham 11200     0.5987     0.2590
##  3 100706 University of Alabama in Huntsville  5525     0.7012     0.1310
##  4 100724            Alabama State University  5354     0.0161     0.9285
##  5 100751           The University of Alabama 28692     0.7865     0.1140
##  6 100812             Athens State University  2999     0.7513     0.1064
##  7 100830     Auburn University at Montgomery  4322     0.5532     0.3031
##  8 100858                   Auburn University 19761     0.8543     0.0714
##  9 100937         Birmingham Southern College  1181     0.8103     0.0940
## 10 101073           Concordia College Alabama   523     0.0115     0.9388
## 11 101189                 Faulkner University  2358     0.3919     0.5254
## 12 101435                  Huntingdon College  1100     0.5845     0.1918
## 13 101480       Jacksonville State University  7195     0.6655     0.2628
## 14 101541                      Judson College   331     0.7915     0.1450
## 15 101587          University of West Alabama  1988     0.4291     0.4195
## 16 101675                       Miles College  1663     0.0180     0.9693
## 17 101693                University of Mobile  1472     0.6461     0.2160
## 18 101709            University of Montevallo  2610     0.7368     0.1398
## 19 101879         University of North Alabama  5536     0.7150     0.1384
## 20 101912                  Oakwood University  1854     0.0102     0.8495
## # ... with 1,418 more rows, and 9 more variables: UGDS_HISP <dbl>,
## #   UGDS_ASIAN <dbl>, UGDS_AIAN <dbl>, UGDS_NHPI <dbl>, UGDS_2MOR <dbl>,
## #   UGDS_NRA <dbl>, UGDS_UNKN <dbl>, probSchool <dbl>, totprob <dbl>

This computes P(E | not(H)) the conditional probability of evidence given the hypothesis is FALSE using inputs of cpH = P(E | H) probability of evidence given the hypothesis is TRUE and pH = P(H) the marginal (prior) probability of hypothesis being TRUE. But, in this case of almost 1500 schools, where “H” is school si, P(E|not(H)) is approximately P(E) = sum(P(E|H)*P(H);over H). So, could skip this function and just use P(E).

IMPORTANT: I use log10 of the Bayes factors, which is a bit unusual, but I know powers of 10 better than powers of 2 or e.

cpNotH <- function(cpH,pH){
  pEH <- cpH * pH
  pE  <- sum(pEH)
  return((pE - pEH)/(1 - pH))
}
BFactor <- function(cpH,cpNotH,logFunc=log10){
  x <- cpH / cpNotH
  return(if(is.function(logFunc)) logFunc(x) else x)
}
makeCpNotH <- function(x) mapply(cpNotH,cpH=x,pH=ethnicity$probSchool)
makeBF <- function(x) {x <- 1.0E-9 + x;log10(x/sum(x*ethnicity$probSchool,na.rm=TRUE))} # Approximate

ethnicBF <- ethnicity %>% 
  dplyr::select(contains("UGDS_")) %>% 
  mutate_each(funs(makeBF)) %>% 
  setNames(gsub("UGDS","BF",names(.))) %$%
  bind_cols(ethnicity[1:2],.) %>%
  mutate(BF_female_2005 = makeBF(student$female_2005))
## `mutate_each()` is deprecated.
## Use `mutate_all()`, `mutate_at()` or `mutate_if()` instead.
## To map `funs` over all variables, use `mutate_all()`
#=====================================================

Income:

income <- student %>% 
  dplyr::select(unitID,College,UGDS,DEP_STAT_PCT_IND,contains("INC_PCT")) %>%
  mutate(probSchool = UGDS/sum(UGDS),
         totprob      = rowSums(as.matrix(.[c("INC_PCT_LO","INC_PCT_M1","INC_PCT_M2","INC_PCT_H1",
                                              "INC_PCT_H2")])),
         totprobdep   = rowSums(as.matrix(.[c("DEP_INC_PCT_LO","DEP_INC_PCT_M1","DEP_INC_PCT_M2",
                                              "DEP_INC_PCT_H1","DEP_INC_PCT_H2")])),
         totprobind   = rowSums(as.matrix(.[c("IND_INC_PCT_LO","IND_INC_PCT_M1","IND_INC_PCT_M2",
                                              "IND_INC_PCT_H1","IND_INC_PCT_H2")])),
         totprob2dep  = rowSums(as.matrix(.[c("DEP_INC_PCT_LO","DEP_INC_PCT_H2")])),
         totprob2ind  = rowSums(as.matrix(.[c("IND_INC_PCT_LO","IND_INC_PCT_H2")])))
income %>% print(n=20)
## # A tibble: 1,438 x 25
##    unitID                             College  UGDS DEP_STAT_PCT_IND
##     <dbl>                              <fctr> <dbl>            <dbl>
##  1 100654            Alabama A & M University  4051       0.12474184
##  2 100663 University of Alabama at Birmingham 11200       0.29927474
##  3 100706 University of Alabama in Huntsville  5525       0.37228542
##  4 100724            Alabama State University  5354       0.15468227
##  5 100751           The University of Alabama 28692       0.16031165
##  6 100812             Athens State University  2999       0.73529412
##  7 100830     Auburn University at Montgomery  4322       0.29577465
##  8 100858                   Auburn University 19761       0.08752683
##  9 100937         Birmingham Southern College  1181       0.02843602
## 10 101073           Concordia College Alabama   523       0.16939891
## 11 101189                 Faulkner University  2358       0.58565955
## 12 101435                  Huntingdon College  1100       0.25824176
## 13 101480       Jacksonville State University  7195       0.27142857
## 14 101541                      Judson College   331       0.27363184
## 15 101587          University of West Alabama  1988       0.25444265
## 16 101675                       Miles College  1663       0.15498652
## 17 101693                University of Mobile  1472       0.36000000
## 18 101709            University of Montevallo  2610       0.18958155
## 19 101879         University of North Alabama  5536       0.20905350
## 20 101912                  Oakwood University  1854       0.21806167
## # ... with 1,418 more rows, and 21 more variables: INC_PCT_LO <dbl>,
## #   DEP_INC_PCT_LO <dbl>, IND_INC_PCT_LO <dbl>, DEP_INC_PCT_M1 <dbl>,
## #   DEP_INC_PCT_M2 <dbl>, DEP_INC_PCT_H1 <dbl>, IND_INC_PCT_M1 <dbl>,
## #   IND_INC_PCT_M2 <dbl>, IND_INC_PCT_H1 <dbl>, IND_INC_PCT_H2 <dbl>,
## #   DEP_INC_PCT_H2 <dbl>, INC_PCT_M1 <dbl>, INC_PCT_M2 <dbl>,
## #   INC_PCT_H1 <dbl>, INC_PCT_H2 <dbl>, probSchool <dbl>, totprob <dbl>,
## #   totprobdep <dbl>, totprobind <dbl>, totprob2dep <dbl>,
## #   totprob2ind <dbl>
# To avoid dropping many schools due to missing values will hold 7 variables
# for income: DEP_STAT_PCT_IND, DEP_INC_PCT_LO, DEP_INC_PCT_H2 &
# DEP_INC_GTLO_LTH2, IND_INC_PCT_LO, IND_INC_PCT_H2 & IND_INC_GTLO_LTH2; where
# the x_INC_GTLO_LTH2 variables are 1 - x_INC_PCT_H2 - x_INC_PCT_LO; and the
# sum of the latter 6 columns = 1. So must mutate the current IND_... columns
# by multiplying by DEP_STAT_PCT_IND. An "x" is appended to the end of the
# newly computed variable names.

income %<>% mutate(probSchool = UGDS/sum(UGDS),
                   DEP_STAT_PCT_INDx  = DEP_STAT_PCT_IND,
                   DEP_STAT_PCT_DEPx  = 1 - DEP_STAT_PCT_INDx,
                   DEP_INC_PCT_LOx    = DEP_INC_PCT_LO*DEP_STAT_PCT_DEPx,
                   DEP_INC_PCT_HI2x   = DEP_INC_PCT_H2*DEP_STAT_PCT_DEPx,
                   DEP_INC_GTLO_LTH2x = DEP_STAT_PCT_DEPx - DEP_INC_PCT_LOx - DEP_INC_PCT_HI2x,
                   IND_INC_PCT_LOx    = IND_INC_PCT_LO*DEP_STAT_PCT_INDx,
                   IND_INC_PCT_HI2x   = IND_INC_PCT_H2*DEP_STAT_PCT_INDx,
                   IND_INC_GTLO_LTH2x = DEP_STAT_PCT_IND - IND_INC_PCT_LOx - IND_INC_PCT_HI2x,
                   totprobx = DEP_INC_PCT_LOx + DEP_INC_PCT_HI2x + DEP_INC_GTLO_LTH2x + IND_INC_PCT_LOx + IND_INC_PCT_HI2x + IND_INC_GTLO_LTH2x) %>%
  dplyr::select(unitID,College,UGDS,probSchool,matches("x$"))
income %>% print(n=20)
## # A tibble: 1,438 x 13
##    unitID                             College  UGDS   probSchool
##     <dbl>                              <fctr> <dbl>        <dbl>
##  1 100654            Alabama A & M University  4051 5.165870e-04
##  2 100663 University of Alabama at Birmingham 11200 1.428234e-03
##  3 100706 University of Alabama in Huntsville  5525 7.045528e-04
##  4 100724            Alabama State University  5354 6.827467e-04
##  5 100751           The University of Alabama 28692 3.658829e-03
##  6 100812             Athens State University  2999 3.824351e-04
##  7 100830     Auburn University at Montgomery  4322 5.511452e-04
##  8 100858                   Auburn University 19761 2.519940e-03
##  9 100937         Birmingham Southern College  1181 1.506021e-04
## 10 101073           Concordia College Alabama   523 6.669341e-05
## 11 101189                 Faulkner University  2358 3.006942e-04
## 12 101435                  Huntingdon College  1100 1.402730e-04
## 13 101480       Jacksonville State University  7195 9.175126e-04
## 14 101541                      Judson College   331 4.220941e-05
## 15 101587          University of West Alabama  1988 2.535115e-04
## 16 101675                       Miles College  1663 2.120672e-04
## 17 101693                University of Mobile  1472 1.877107e-04
## 18 101709            University of Montevallo  2610 3.328295e-04
## 19 101879         University of North Alabama  5536 7.059555e-04
## 20 101912                  Oakwood University  1854 2.364237e-04
## # ... with 1,418 more rows, and 9 more variables: DEP_STAT_PCT_INDx <dbl>,
## #   DEP_STAT_PCT_DEPx <dbl>, DEP_INC_PCT_LOx <dbl>,
## #   DEP_INC_PCT_HI2x <dbl>, DEP_INC_GTLO_LTH2x <dbl>,
## #   IND_INC_PCT_LOx <dbl>, IND_INC_PCT_HI2x <dbl>,
## #   IND_INC_GTLO_LTH2x <dbl>, totprobx <dbl>
#income[income$College=='California Institute of Technology',
#        c('IND_INC_PCT_LOx','IND_INC_PCT_HI2x','IND_INC_GTLO_LTH2x','totprobx')] <- c(0,0,0,1)

makeBF <- function(x) {x <- 1.0E-9 + x;log10(x/sum(x*income$probSchool,na.rm=TRUE))} # Approximate
incomeBF <- income %>% 
  dplyr::select(matches("x$"),-totprobx) %>% 
  mutate_each(funs(makeBF)) %>% 
  setNames(gsub("(.+)x$","BF_\\1x",names(.))) %$%
  bind_cols(income[1:4],.)
## `mutate_each()` is deprecated.
## Use `mutate_all()`, `mutate_at()` or `mutate_if()` instead.
## To map `funs` over all variables, use `mutate_all()`
incomeBFdiscrete <- incomeBF %>% dplyr::select(5:ncol(incomeBF)) %>% filter(complete.cases(.)) %>%
  lapply(function(x) discretize(data.frame(x[ifelse(is.na(x),FALSE,x>-5)]),method='quantile',breaks=5))
incomeBFlowestState <- incomeBF %>% dplyr::select(5:ncol(incomeBF)) %>% filter(complete.cases(.)) %>%
  lapply(function(x) ifelse(x<= -5,"(-Inf,-5]",'FALSE'))
incomeBFd <- mapply(function(x,y) {
  tmp <- x
  tmp[tmp=='FALSE'] <- as.character(y[[1]])
  return(factor(tmp,levels=c("(-Inf,-5]",levels(y[[1]]))))},
  incomeBFlowestState,incomeBFdiscrete,SIMPLIFY = FALSE)
names(incomeBFd) <- names(incomeBFdiscrete)
incomeBFd %<>% data.frame %>% tbl_df
rm(list=c('incomeBFdiscrete','incomeBFlowestState'))

incomeBFd <-  incomeBF %>% 
  filter(complete.cases(incomeBF %>% dplyr::select(5:ncol(incomeBF)))) %>% 
  dplyr::select(1:4) %>% 
  bind_cols(incomeBFd)

Here, I did some cluster analysis using package poLCA but ultimately set it aside to be dug into later…

# #-------------------------------------
# bmod <- poLCA(as.formula(paste0('cbind(',paste(names(incomeBFd)[-(1:4)],collapse=','),') ~ 1')),
#               data=incomeBFd,nclass = 9,nrep=13,na.rm=FALSE)
# new.probs.start <- poLCA.reorder(bmod$probs.start,order(bmod$P,decreasing=TRUE))
# bmod <- poLCA(as.formula(paste0('cbind(',paste(names(incomeBFd)[-(1:4)],collapse=','),') ~ 1')),
#               data=incomeBFd,nclass = 9,probs.start=new.probs.start,na.rm=FALSE)
# 
# # Compute the mutual information shared by each variable with the class labels:
# miBFclass <- incomeBFd %>% 
#   mutate(class=bmod$predclass) %$% 
#   sapply(grep('^BF',names(.),value=TRUE),mInfo,ynm='class',data=.) %>% 
#   sort(decreasing=TRUE) %T>%
#   print
# incomeBFd %>% 
#   mutate(class=bmod$predclass) %>% 
#   inner_join(incomeBF,by='unitID') %>% 
#   group_by(class) %>% 
#   summarize_each(funs(median),matches('x.y$')) %>%
#   dplyr::select(one_of(names(miBFclass))) %>% 
#   print(width=Inf)
# 
# incomeBFd %>% 
#   dplyr::select(-c(1,3:4)) %>% 
#   mutate(class=bmod$predclass) %>% 
#   filter(class==9) %>% 
#   print(n=20,width=Inf)
# 
# incomeBFd %>% 
#   mutate(class=bmod$predclass) %>% 
#   filter(grepl("Cali.+Insti.+Tech|Harvard|Mass.+Inst.+Tech|Northwestern Univ|Princeton|Cornell U",
#                College)) %>%
#   dplyr::select(one_of(setdiff(names(.),names(miBFclass))),one_of(names(miBFclass))) %>% 
#   arrange(class) %>%
#   print(width=Inf)
# #---------------------------------------
#===================================================

Student Aid:

aid <- student %>% 
  dplyr::select(unitID,College,UGDS,matches("pell_ever|fsend")) %>%
  mutate(probSchool = UGDS/sum(UGDS),
         totprob    = rowSums(as.matrix(.[grepl('fsend',names(.))])))
aid %>% print(n=20)
## # A tibble: 1,438 x 11
##    unitID                             College  UGDS pell_ever_2005
##     <dbl>                              <fctr> <dbl>          <dbl>
##  1 100654            Alabama A & M University  4051           0.81
##  2 100663 University of Alabama at Birmingham 11200           0.59
##  3 100706 University of Alabama in Huntsville  5525           0.60
##  4 100724            Alabama State University  5354           0.85
##  5 100751           The University of Alabama 28692           0.53
##  6 100812             Athens State University  2999           0.68
##  7 100830     Auburn University at Montgomery  4322           0.67
##  8 100858                   Auburn University 19761           0.46
##  9 100937         Birmingham Southern College  1181           0.35
## 10 101073           Concordia College Alabama   523           1.00
## 11 101189                 Faulkner University  2358           0.70
## 12 101435                  Huntingdon College  1100           0.53
## 13 101480       Jacksonville State University  7195           0.70
## 14 101541                      Judson College   331           0.62
## 15 101587          University of West Alabama  1988           0.76
## 16 101675                       Miles College  1663           0.89
## 17 101693                University of Mobile  1472           0.62
## 18 101709            University of Montevallo  2610           0.61
## 19 101879         University of North Alabama  5536           0.65
## 20 101912                  Oakwood University  1854           0.69
## # ... with 1,418 more rows, and 7 more variables: fsend_1_2005 <dbl>,
## #   fsend_2_2005 <dbl>, fsend_3_2005 <dbl>, fsend_4_2005 <dbl>,
## #   fsend_5_2005 <dbl>, probSchool <dbl>, totprob <dbl>
makeBF <- function(x) {x <- 1.0E-9 + x;log10(x/sum(x*aid$probSchool,na.rm=TRUE))} # Approximate
aidBF <- aid %>% 
  dplyr::select(matches("pell_ever|fsend")) %>% 
  mutate_each(funs(makeBF)) %>% 
  setNames(paste0("BF_",names(.))) %$%
  bind_cols(aid[1:3],.)
## `mutate_each()` is deprecated.
## Use `mutate_all()`, `mutate_at()` or `mutate_if()` instead.
## To map `funs` over all variables, use `mutate_all()`
g <- aidBF %>% gather(key=Variable,value=BF,-(1:3)) %>%
  ggplot(aes(x=BF,fill=Variable))
g + geom_density(alpha=0.3) + 
  facet_wrap(~Variable) + 
  scale_x_continuous(lim=c(-1,1)) + 
  theme(text=element_text(size=14,face = 'bold'))
## Warning: Removed 2 rows containing non-finite values (stat_density).

#===================================================

Earnings after graduation (6 years on)

earnRepay <- student %>% 
  dplyr::select(unitID,College,UGDS,matches("^(RPY|CDR|.+earn_wne_p6)")) %T>% 
  print(n=20) %>%
  mutate(probSchool = UGDS/sum(UGDS),
         sdlog         = sqrt(log((sd_earn_wne_p6_2005/mn_earn_wne_p6_2005)^2+1)),
         meanlog       = log(mn_earn_wne_p6_2005^2/sqrt(mn_earn_wne_p6_2005^2+sd_earn_wne_p6_2005^2)),
         p_le30K       = plnorm(30.0E3,meanlog,sdlog),
         p_gt30Kle48K  = plnorm(48.0E3,meanlog,sdlog)  - p_le30K,
         p_gt48Kle75K  = plnorm(75.0E3,meanlog,sdlog)  - plnorm(48.0E3,meanlog,sdlog),
         p_gt75Kle110K = plnorm(110.0E3,meanlog,sdlog) - plnorm(75.0E3,meanlog,sdlog),
         p_gt110K      = plnorm(110.0E3,meanlog,sdlog,lower.tail=FALSE),
         totprob       = p_le30K + p_gt30Kle48K + p_gt48Kle75K + p_gt75Kle110K + p_gt110K) %>%
  dplyr::select(1:7,probSchool,starts_with('p_')) %T>%
  print(n=20)
## # A tibble: 1,438 x 14
##    unitID                             College  UGDS  CDR3 RPY_3YR_RT
##     <dbl>                              <fctr> <dbl> <dbl>      <dbl>
##  1 100654            Alabama A & M University  4051 0.163  0.4447139
##  2 100663 University of Alabama at Birmingham 11200 0.080  0.7562667
##  3 100706 University of Alabama in Huntsville  5525 0.077  0.7819979
##  4 100724            Alabama State University  5354 0.191  0.3311989
##  5 100751           The University of Alabama 28692 0.081  0.8139413
##  6 100812             Athens State University  2999 0.094  0.7676491
##  7 100830     Auburn University at Montgomery  4322 0.131  0.6288562
##  8 100858                   Auburn University 19761 0.061  0.8824554
##  9 100937         Birmingham Southern College  1181 0.099  0.8498498
## 10 101073           Concordia College Alabama   523 0.279  0.2541254
## 11 101189                 Faulkner University  2358 0.120  0.5630294
## 12 101435                  Huntingdon College  1100 0.105  0.7814313
## 13 101480       Jacksonville State University  7195 0.113  0.6276079
## 14 101541                      Judson College   331 0.117  0.7181208
## 15 101587          University of West Alabama  1988 0.096  0.5432300
## 16 101675                       Miles College  1663 0.221  0.2167889
## 17 101693                University of Mobile  1472 0.035  0.7558570
## 18 101709            University of Montevallo  2610 0.098  0.7711864
## 19 101879         University of North Alabama  5536 0.134  0.6879010
## 20 101912                  Oakwood University  1854 0.107  0.4392157
## # ... with 1,418 more rows, and 9 more variables: RPY_5YR_RT <dbl>,
## #   RPY_7YR_RT <dbl>, mn_earn_wne_p6_2005 <dbl>,
## #   md_earn_wne_p6_2005 <dbl>, pct10_earn_wne_p6_2005 <dbl>,
## #   pct25_earn_wne_p6_2005 <dbl>, pct75_earn_wne_p6_2005 <dbl>,
## #   pct90_earn_wne_p6_2005 <dbl>, sd_earn_wne_p6_2005 <dbl>
## # A tibble: 1,438 x 13
##    unitID                             College  UGDS  CDR3 RPY_3YR_RT
##     <dbl>                              <fctr> <dbl> <dbl>      <dbl>
##  1 100654            Alabama A & M University  4051 0.163  0.4447139
##  2 100663 University of Alabama at Birmingham 11200 0.080  0.7562667
##  3 100706 University of Alabama in Huntsville  5525 0.077  0.7819979
##  4 100724            Alabama State University  5354 0.191  0.3311989
##  5 100751           The University of Alabama 28692 0.081  0.8139413
##  6 100812             Athens State University  2999 0.094  0.7676491
##  7 100830     Auburn University at Montgomery  4322 0.131  0.6288562
##  8 100858                   Auburn University 19761 0.061  0.8824554
##  9 100937         Birmingham Southern College  1181 0.099  0.8498498
## 10 101073           Concordia College Alabama   523 0.279  0.2541254
## 11 101189                 Faulkner University  2358 0.120  0.5630294
## 12 101435                  Huntingdon College  1100 0.105  0.7814313
## 13 101480       Jacksonville State University  7195 0.113  0.6276079
## 14 101541                      Judson College   331 0.117  0.7181208
## 15 101587          University of West Alabama  1988 0.096  0.5432300
## 16 101675                       Miles College  1663 0.221  0.2167889
## 17 101693                University of Mobile  1472 0.035  0.7558570
## 18 101709            University of Montevallo  2610 0.098  0.7711864
## 19 101879         University of North Alabama  5536 0.134  0.6879010
## 20 101912                  Oakwood University  1854 0.107  0.4392157
## # ... with 1,418 more rows, and 8 more variables: RPY_5YR_RT <dbl>,
## #   RPY_7YR_RT <dbl>, probSchool <dbl>, p_le30K <dbl>, p_gt30Kle48K <dbl>,
## #   p_gt48Kle75K <dbl>, p_gt75Kle110K <dbl>, p_gt110K <dbl>
makeBF <- function(x) {x <- 1.0E-9 + x;log10(x/sum(x*earnRepay$probSchool,na.rm=TRUE))} # Approximate
earnRepayBF <- earnRepay %>% 
  dplyr::select(-c(1:3,5:8)) %>% 
  mutate_each(funs(makeBF)) %>% 
  setNames(paste0("BF_",names(.))) %$%
  bind_cols(earnRepay[c(1:3,8)],.) %>%
  filter(complete.cases(.))
## `mutate_each()` is deprecated.
## Use `mutate_all()`, `mutate_at()` or `mutate_if()` instead.
## To map `funs` over all variables, use `mutate_all()`
earnRepayBF %>% 
  dplyr::select(-3) %>% 
  inner_join(student %>% dplyr::select(unitID,ADM_RATE),by='unitID') %>% 
  filter(!is.na(ADM_RATE)) %>% 
  mutate(ADM_RATEd=discretize(.['ADM_RATE'],method='interval',breaks=6)[['ADM_RATE']]) %>% 
  arrange(desc(BF_p_gt110K)) %>% 
  filter(BF_p_le30K<0.5 & BF_p_gt110K>0.5) %$% 
  qplot(x=BF_p_le30K,y=BF_p_gt110K,color=ADM_RATEd) + geom_text(aes(label=College))

#===================================================

College Setting: locale & region.

student %$% { locale %>% summary %>% print }
##                                                                                                            City: Large (population of 250,000 or more) 
##                                                                                                                                                    308 
##                                                                                   City: Midsize (population of at least 100,000 but less than 250,000) 
##                                                                                                                                                    184 
##                                                                                                             City: Small (population less than 100,000) 
##                                                                                                                                                    214 
## Rural: Distant (rural territory more than 5 miles but up to 25 miles from an urbanized area or more than 2.5 and up to 10 miles from an urban cluster) 
##                                                                                                                                                     24 
##                                          Rural: Fringe (rural territory up to 5 miles from an urbanized area or up to 2.5 miles from an urban cluster) 
##                                                                                                                                                     45 
##                                 Rural: Remote (rural territory more than 25 miles from an urbanized area and more than 10 miles from an urban cluster) 
##                                                                                                                                                     12 
##                                                           Suburb: Large (outside principal city, in urbanized area with population of 250,000 or more) 
##                                                                                                                                                    255 
##                                  Suburb: Midsize (outside principal city, in urbanized area with population of at least 100,000 but less than 250,000) 
##                                                                                                                                                     43 
##                                                            Suburb: Small (outside principal city, in urbanized area with population less than 100,000) 
##                                                                                                                                                     30 
##                                                          Town: Distant (in urban cluster more than 10 miles and up to 35 miles from an urbanized area) 
##                                                                                                                                                    150 
##                                                                                  Town: Fringe (in urban cluster up to 10 miles from an urbanized area) 
##                                                                                                                                                     59 
##                                                                              Town: Remote (in urban cluster more than 35 miles from an urbanized area) 
##                                                                                                                                                    114
student %$% { region %>% summary %>% print }
##                          Far West (AK, CA, HI, NV, OR, WA) 
##                                                        143 
##                           Great Lakes (IL, IN, MI, OH, WI) 
##                                                        228 
##                          Mid East (DE, DC, MD, NJ, NY, PA) 
##                                                        265 
##                       New England (CT, ME, MA, NH, RI, VT) 
##                                                        137 
##                        Plains (IA, KS, MN, MO, NE, ND, SD) 
##                                                        160 
##                       Rocky Mountains (CO, ID, MT, UT, WY) 
##                                                         41 
## Southeast (AL, AR, FL, GA, KY, LA, MS, NC, SC, TN, VA, WV) 
##                                                        354 
##                                 Southwest (AZ, NM, OK, TX) 
##                                                        110
localeNames <- sort(unique(as.character(student$locale)))
localeAggregates <- setNames(c(gsub('([^ ]+) ([^ ]+).+','\\1\\2',localeNames[1:3]),
                               rep('Rural',3),
                               gsub('([^ ]+) ([^ ]+).+','\\1\\2',localeNames[7]),
                               rep('Suburb:Small/Midsize & Town:Fringe',2),
                               gsub('([^ ]+) ([^ ]+).+','\\1\\2',localeNames[10]),
                               'Suburb:Small/Midsize & Town:Fringe',
                               gsub('([^ ]+) ([^ ]+).+','\\1\\2',localeNames[12])),localeNames)
student %<>% mutate(localeAgg = factor(localeAggregates[as.character(locale)],
                                       levels=c('Rural','Town:Remote','Town:Distant','Suburb:Small/Midsize & Town:Fringe',
                                                'Suburb:Large','City:Small','City:Midsize','City:Large')))
makeBF <- function(x) {x <- 1.0E-9 + x;log10(x/sum(x*settingBF$probSchool,na.rm=TRUE))} # Approximate
settingBF <- student %>% 
  dplyr::select(unitID,College,UGDS,localeAgg,region) %>%
  mutate(probSchool = UGDS/sum(UGDS))
a <- settingBF %$% model.matrix(~localeAgg - 1,data=.)
colnames(a) <- gsub('^([^:]+):*(.+)$','BF_\\1\\2',colnames(a))
a %<>% as.data.frame %>% tbl_df %>% mutate_each(funs(makeBF))
## `mutate_each()` is deprecated.
## Use `mutate_all()`, `mutate_at()` or `mutate_if()` instead.
## To map `funs` over all variables, use `mutate_all()`
settingBF %<>% dplyr::select(-localeAgg) %>% bind_cols(a)

a <- settingBF %>%
  mutate(regionName = factor(gsub(' ','',as.character(student$region)))) %$% 
  model.matrix(~regionName - 1,data=.)
colnames(a) <- gsub('^regionName','BF_',colnames(a))
a %<>% as.data.frame %>% tbl_df %>% mutate_each(funs(makeBF))
## `mutate_each()` is deprecated.
## Use `mutate_all()`, `mutate_at()` or `mutate_if()` instead.
## To map `funs` over all variables, use `mutate_all()`
settingBF %<>% dplyr::select(-region) %>% bind_cols(a)

Didn’t get around to adding in costs…

#===================================================
#costBF
#===================================================

Academics: Completion rates, admissions rates and SAT scores.

Took a look at scaled test scores, but ultimately settled upon the log-normal fits to SATs below.

# scaleSAT <- function(Score) (Score - 280)/(800 - 280)
# scaleACT <- function(Score) (Score -   9)/( 36 -   9)
# Impute missing values with the means
#lmtest <- lm(ACTCMMID ~ SAT_AVG,data=student)

# academicsBF <- student %>% 
#   filter(!(isSpecialty=='TRUE' & is.na(ADM_RATE))) %>%
#   dplyr::select(unitID,College,ADM_RATE,matches('(^C150)|(SAT|ACT)')) %>%
#   filter(!is.na(C150_4_POOLED_SUPP)) %>%
#   mutate(ADM_RATE = ifelse(is.na(ADM_RATE),mean(ADM_RATE,na.rm=TRUE),ADM_RATE)) %>%
#   mutate(SAT_AVG_scaled  = scaleSAT(ifelse(is.na(SAT_AVG) , mean(SAT_AVG,na.rm=TRUE), SAT_AVG))) %>%
#   mutate(ACTCMMID_scaled = scaleACT(ifelse(is.na(ACTCMMID),mean(ACTCMMID,na.rm=TRUE),ACTCMMID))) %>%
#   dplyr::select(1:3,ncol(.) - (2:0))

# a <- academicsBF %>% 
#   mutate_each(funs(scaleSAT),starts_with('SAT'),-ends_with('scaled')) %>% 
#   mutate_each(funs(scaleACT),starts_with('ACT'),-ends_with('scaled')) %>%
#   gather(key=Test,value=scaledScore,-unitID,-College,-ADM_RATE)

NOTE: This is a bit of a kludge, but it results in a distribution for SAT Verbal+Math scores for each school, from which Bayes factors for 200-pt intervals are computed. (Kludginess: Adding the quartiles of Verbal & Math DOES NOT equal the quartiles of Verbal+Math.)

student %>% 
  mutate(SAT_25=SATVR25+SATMT25,SAT_75=SATVR75+SATMT75) %>% 
  filter(!is.na(SAT_AVG) & !is.na(SAT_75)) %>% 
  ggplot(aes(x=SAT_AVG)) + geom_density(fill='orange',alpha=0.3) + 
  geom_density(aes(x=SAT_25),fill='red',alpha=0.1) + 
  geom_density(aes(x=SAT_75),fill='green',alpha=0.1)

academics <- student %>% 
  mutate(SAT_25=SATVR25+SATMT25,SAT_75=SATVR75+SATMT75) %>% 
  filter(!is.na(SAT_AVG) & !is.na(SAT_75)) %>% 
  dplyr::select(unitID,College,UGDS,ADM_RATE,C150_4_POOLED_SUPP,SAT_25,SAT_AVG,SAT_75)


# Clumsily fit log-normal distributions to the SAT score quartiles for each school.
fr <- function(x,SAT_AVG,SAT_25,SAT_75,probs=c(0.25,0.75)) {
  sdl <- x[1]
  q <- qlnorm(p=probs,meanlog=log(SAT_AVG) - sdl^2/2,sdlog=sdl)
  log(SAT_25/q[1])^2 + log(SAT_75/q[2])^2
}
getMuSd <- function(SAT_AVG,SAT_25,SAT_75) {
  probs  <- c(0.25,0.75)
  SAT_AVG0 <- SAT_AVG
  if(SAT_75==1600) {
    probs   <- probs/(0.75+1.0E-5)
    SAT_AVG <- (SAT_AVG0 - 1600*(1-(0.75+1.0E-5)))/(0.75+1.0E-5)
  }
  soln <- optim(c(0.05), fr,lower=1.0E-3,upper=0.2,method='L-BFGS-B',
                SAT_AVG=SAT_AVG,SAT_25=SAT_25,SAT_75=SAT_75,probs=probs)
  sdl  <- soln$par
  meanlog  <- log(SAT_AVG) - sdl^2/2
  pSAT_AVG <- exp(meanlog+0.5*sdl^2)
  q   <- qlnorm(p=probs,meanlog=meanlog,sdlog=sdl)
  if(SAT_75==1600){
    pSAT_AVG <- pSAT_AVG*(0.75+1.0E-5) + 1600*(1-(0.75+1.0E-5))
  }
  return(c(meanlog=meanlog,sdlog=sdl,pSAT_25=q[1],pSAT_AVG=pSAT_AVG,pSAT_75=q[2]))
}

Get the log-normal parameters of the SAT distribution for each school

academics <- academics %$% {bind_cols(.,as.data.frame(t(mapply(getMuSd,SAT_AVG,SAT_25,SAT_75))))}
academics %>% 
  filter(grepl('Harvard|Cal.+Inst.+Tech|Mass.+Inst.+Tech|Princeton U|Northwestern U|Cornell U',College)) %>% 
  arrange(desc(SAT_AVG))
## # A tibble: 6 x 13
##   unitID                               College  UGDS ADM_RATE
##    <dbl>                                <fctr> <dbl>    <dbl>
## 1 110404    California Institute of Technology   977   0.1055
## 2 166683 Massachusetts Institute of Technology  4510   0.0815
## 3 166027                    Harvard University  7278   0.0584
## 4 186131                  Princeton University  5234   0.0741
## 5 147767               Northwestern University  8855   0.1532
## 6 190415                    Cornell University 14309   0.1556
## # ... with 9 more variables: C150_4_POOLED_SUPP <dbl>, SAT_25 <dbl>,
## #   SAT_AVG <dbl>, SAT_75 <dbl>, meanlog <dbl>, sdlog <dbl>,
## #   pSAT_25 <dbl>, pSAT_AVG <dbl>, pSAT_75 <dbl>

Discretize the SAT distributions

academics %<>%
  filter(!is.na(C150_4_POOLED_SUPP)) %>%
  mutate(probSchool = UGDS/sum(UGDS),
         p_le800        = plnorm( 800,meanlog,sdlog),
         p_gt800le1000  = plnorm(1000,meanlog,sdlog) - p_le800,
         p_gt1000le1200 = plnorm(1200,meanlog,sdlog) - plnorm(1000,meanlog,sdlog),
         p_gt1200le1400 = plnorm(1400,meanlog,sdlog) - plnorm(1200,meanlog,sdlog),
         p_gt1400       = plnorm(1400,meanlog,sdlog,lower.tail=FALSE),
         totprob        = p_le800 + p_gt800le1000 + p_gt1000le1200 + p_gt1200le1400 + p_gt1400) %>%
  print
## # A tibble: 1,119 x 20
##    unitID                             College  UGDS ADM_RATE
##     <dbl>                              <fctr> <dbl>    <dbl>
##  1 100654            Alabama A & M University  4051   0.8989
##  2 100663 University of Alabama at Birmingham 11200   0.8673
##  3 100706 University of Alabama in Huntsville  5525   0.8062
##  4 100724            Alabama State University  5354   0.5125
##  5 100751           The University of Alabama 28692   0.5655
##  6 100858                   Auburn University 19761   0.8274
##  7 100937         Birmingham Southern College  1181   0.6422
##  8 101435                  Huntingdon College  1100   0.6279
##  9 101480       Jacksonville State University  7195   0.8326
## 10 101541                      Judson College   331   0.7388
## # ... with 1,109 more rows, and 16 more variables:
## #   C150_4_POOLED_SUPP <dbl>, SAT_25 <dbl>, SAT_AVG <dbl>, SAT_75 <dbl>,
## #   meanlog <dbl>, sdlog <dbl>, pSAT_25 <dbl>, pSAT_AVG <dbl>,
## #   pSAT_75 <dbl>, probSchool <dbl>, p_le800 <dbl>, p_gt800le1000 <dbl>,
## #   p_gt1000le1200 <dbl>, p_gt1200le1400 <dbl>, p_gt1400 <dbl>,
## #   totprob <dbl>

Compute Bayes Factors for the discretized SAT score distributions of each school

makeBF <- function(x) {x <- 1.0E-9 + x;log10(x/sum(x*academics$probSchool,na.rm=TRUE))} # Approximate
academicsBF <- academics %>% 
  dplyr::select(-c(1:3,matches('SAT|log|prob'))) %>% 
  mutate_each(funs(makeBF)) %>% 
  setNames(gsub("^p","BF_SAT",names(.))) %>%
  setNames(gsub("^([^B])","BF_\\1",names(.))) %$%
  bind_cols(academics[1:3],.)
## `mutate_each()` is deprecated.
## Use `mutate_all()`, `mutate_at()` or `mutate_if()` instead.
## To map `funs` over all variables, use `mutate_all()`
academicsBF %>% 
  filter(grepl('Harvard|Cal.+Inst.+Tech|Mass.+Inst.+Tech|Princeton U|Northwestern U|Cornell U',College)) %>% 
  arrange(desc(BF_SAT_gt1400)) %>% 
  print(width=Inf)
## # A tibble: 6 x 10
##   unitID                               College  UGDS BF_ADM_RATE BF_C150_4_POOLED_SUPP BF_SAT_le800 BF_SAT_gt800le1000 BF_SAT_gt1000le1200 BF_SAT_gt1200le1400 BF_SAT_gt1400
##    <dbl>                                <fctr> <dbl>       <dbl>                 <dbl>        <dbl>              <dbl>               <dbl>               <dbl>         <dbl>
## 1 110404    California Institute of Technology   977  -0.7718180             0.1903925    -7.763154          -8.424919          -8.5624697          -7.2934024     1.0517077
## 2 166027                    Harvard University  7278  -1.0286576             0.2119162    -7.763154          -8.424919          -8.5624697          -1.2155538     1.0457960
## 3 186131                  Princeton University  5234  -0.9252523             0.2086179    -7.763154          -8.424919          -8.5624697          -0.8427212     1.0376274
## 4 166683 Massachusetts Institute of Technology  4510  -0.8839129             0.1924128    -7.763154          -7.870078          -2.7437355          -0.1406519     0.9753106
## 5 147767               Northwestern University  8855  -0.6098117             0.1952930    -7.763144          -5.374484          -1.6772526           0.1449584     0.8856398
## 6 190415                    Cornell University 14309  -0.6030609             0.1942822    -6.384011          -2.780483          -0.8064331           0.2595414     0.7831573

Now add in the disciplines:

makeBF <- function(x) {x <- 1.0E-9 + x;log10(x/sum(x*academics$probSchool,na.rm=TRUE))} # Approximate

discplnsBF <-  academics %>% 
  dplyr::select( unitID, probSchool ) %>% 
  inner_join( disciplines2013, by = 'unitID' ) %>%
  mutate_each( funs( makeBF ), -seq_len(which(names(.) == discNames[1])-1) ) %>% 
  setNames( c( setdiff( names(.), discNames ) , paste0( 'BF_', discNames ) ) )
## `mutate_each()` is deprecated.
## Use `mutate_all()`, `mutate_at()` or `mutate_if()` instead.
## To map `funs` over a selection of variables, use `mutate_at()`
academicsBF %<>% inner_join(discplnsBF %>% dplyr::select(unitID,starts_with('BF_')),by='unitID')

Bringing the Data Together

At this point, we’ve got enough pieces of the puzzle to construct a working model that demonstrates the concept of a decision analysis tool that takes as input a student’s demographics and behavioral traits and the dataset of college Bayes factors and outputs a list of the top-N colleges maximizing the utility for that specific student.

First join the separate data tables of Bayes factors into a single table.

# NOW: Build the model covariate data frame with all possible covariate
# candidates. Each student's model only uses a subset of these columns.  See
# function 'utility'....
studentBF <- student     %>% dplyr::select(unitID,College,UGDS,ADM_RATE) %>%
  inner_join(incomeBF    %>% dplyr::select(-College,-UGDS), by='unitID') %>%
  inner_join(ethnicBF    %>% dplyr::select(-College),       by='unitID') %>%
  inner_join(aidBF       %>% dplyr::select(-College,-UGDS), by='unitID') %>%
  inner_join(settingBF   %>% dplyr::select(-College,-UGDS,-probSchool), by='unitID') %>%
  inner_join(earnRepayBF %>% dplyr::select(-College,-UGDS,-probSchool), by='unitID') %>%
  #  inner_join(costBF      %>% dplyr::select(-College,-UGDS), by='unitID') %>%
  inner_join(academicsBF %>% dplyr::select(-College,-UGDS), by='unitID') %>%
  mutate(BF_prior = log10(nrow(.)*probSchool)) %>%
  print(n=20)
## # A tibble: 1,119 x 97
##    unitID                             College  UGDS ADM_RATE   probSchool
##     <dbl>                              <fctr> <dbl>    <dbl>        <dbl>
##  1 100654            Alabama A & M University  4051   0.8989 5.165870e-04
##  2 100663 University of Alabama at Birmingham 11200   0.8673 1.428234e-03
##  3 100706 University of Alabama in Huntsville  5525   0.8062 7.045528e-04
##  4 100724            Alabama State University  5354   0.5125 6.827467e-04
##  5 100751           The University of Alabama 28692   0.5655 3.658829e-03
##  6 100858                   Auburn University 19761   0.8274 2.519940e-03
##  7 100937         Birmingham Southern College  1181   0.6422 1.506021e-04
##  8 101435                  Huntingdon College  1100   0.6279 1.402730e-04
##  9 101480       Jacksonville State University  7195   0.8326 9.175126e-04
## 10 101541                      Judson College   331   0.7388 4.220941e-05
## 11 101693                University of Mobile  1472   0.7125 1.877107e-04
## 12 101709            University of Montevallo  2610   0.8729 3.328295e-04
## 13 101879         University of North Alabama  5536   0.8092 7.059555e-04
## 14 101912                  Oakwood University  1854   0.3435 2.364237e-04
## 15 102049                  Samford University  2995   0.7697 3.819250e-04
## 16 102094         University of South Alabama 11048   0.8604 1.408851e-03
## 17 102234                 Spring Hill College  1269   0.4627 1.618240e-04
## 18 102270                    Stillman College   863   0.4300 1.100505e-04
## 19 102377                 Tuskegee University  2584   0.3511 3.295139e-04
## 20 102669           Alaska Pacific University   340   0.3745 4.335709e-05
## # ... with 1,099 more rows, and 92 more variables:
## #   BF_DEP_STAT_PCT_INDx <dbl>, BF_DEP_STAT_PCT_DEPx <dbl>,
## #   BF_DEP_INC_PCT_LOx <dbl>, BF_DEP_INC_PCT_HI2x <dbl>,
## #   BF_DEP_INC_GTLO_LTH2x <dbl>, BF_IND_INC_PCT_LOx <dbl>,
## #   BF_IND_INC_PCT_HI2x <dbl>, BF_IND_INC_GTLO_LTH2x <dbl>,
## #   BF_WHITE <dbl>, BF_BLACK <dbl>, BF_HISP <dbl>, BF_ASIAN <dbl>,
## #   BF_AIAN <dbl>, BF_NHPI <dbl>, BF_2MOR <dbl>, BF_NRA <dbl>,
## #   BF_UNKN <dbl>, BF_female_2005 <dbl>, BF_pell_ever_2005 <dbl>,
## #   BF_fsend_1_2005 <dbl>, BF_fsend_2_2005 <dbl>, BF_fsend_3_2005 <dbl>,
## #   BF_fsend_4_2005 <dbl>, BF_fsend_5_2005 <dbl>, BF_localeAggRural <dbl>,
## #   BF_localeAggTownRemote <dbl>, BF_localeAggTownDistant <dbl>,
## #   `BF_localeAggSuburbSmall/Midsize & Town:Fringe` <dbl>,
## #   BF_localeAggSuburbLarge <dbl>, BF_localeAggCitySmall <dbl>,
## #   BF_localeAggCityMidsize <dbl>, BF_localeAggCityLarge <dbl>,
## #   `BF_FarWest(AK,CA,HI,NV,OR,WA)` <dbl>,
## #   `BF_GreatLakes(IL,IN,MI,OH,WI)` <dbl>,
## #   `BF_MidEast(DE,DC,MD,NJ,NY,PA)` <dbl>,
## #   `BF_NewEngland(CT,ME,MA,NH,RI,VT)` <dbl>,
## #   `BF_Plains(IA,KS,MN,MO,NE,ND,SD)` <dbl>,
## #   `BF_RockyMountains(CO,ID,MT,UT,WY)` <dbl>,
## #   `BF_Southeast(AL,AR,FL,GA,KY,LA,MS,NC,SC,TN,VA,WV)` <dbl>,
## #   `BF_Southwest(AZ,NM,OK,TX)` <dbl>, BF_CDR3 <dbl>, BF_p_le30K <dbl>,
## #   BF_p_gt30Kle48K <dbl>, BF_p_gt48Kle75K <dbl>, BF_p_gt75Kle110K <dbl>,
## #   BF_p_gt110K <dbl>, BF_ADM_RATE <dbl>, BF_C150_4_POOLED_SUPP <dbl>,
## #   BF_SAT_le800 <dbl>, BF_SAT_gt800le1000 <dbl>,
## #   BF_SAT_gt1000le1200 <dbl>, BF_SAT_gt1200le1400 <dbl>,
## #   BF_SAT_gt1400 <dbl>, BF_AgricultureAgriculture <dbl>,
## #   BF_NaturalResources <dbl>, BF_ArchitectureRelated <dbl>,
## #   BF_AreaEthnic <dbl>, BF_CommunicationJournalism <dbl>,
## #   BF_CommunicationsTechnologies <dbl>, BF_ComputerInformation <dbl>,
## #   BF_PersonalCulinary <dbl>, BF_Education <dbl>, BF_Engineering <dbl>,
## #   BF_EngineeringTechnologies <dbl>, BF_ForeignLanguages <dbl>,
## #   BF_FamilyConsumer <dbl>, BF_LegalProfessions <dbl>,
## #   BF_EnglishLanguage <dbl>, BF_LiberalArts <dbl>,
## #   BF_LibraryScience <dbl>, BF_BiologicalBiomedical <dbl>,
## #   BF_MathematicsStatistics <dbl>, BF_MilitaryTechnologies <dbl>,
## #   BF_MultiInterdisciplinary <dbl>, BF_ParksRecreation <dbl>,
## #   BF_PhilosophyReligious <dbl>, BF_TheologyReligious <dbl>,
## #   BF_PhysicalSciences <dbl>, BF_ScienceTechnologies <dbl>,
## #   BF_Psychology <dbl>, BF_HomelandSecurity <dbl>,
## #   BF_PublicAdministration <dbl>, BF_SocialSciences <dbl>,
## #   BF_ConstructionTrades <dbl>, BF_MechanicRepair <dbl>,
## #   BF_PrecisionProduction <dbl>, BF_TransportationMaterials <dbl>,
## #   BF_VisualPerforming <dbl>, BF_HealthProfessions <dbl>,
## #   BF_BusinessManagement <dbl>, BF_History <dbl>, BF_prior <dbl>

Mapping Student Traits to College Properties

Get the coefficients for the student profile: Two levels of coefficients: 1.Property Category level, 2. Subcategory level

Trait level has 4 types with impact on properties as follows when each trait increases from low to medium to high:

  1. Risk – willingness to surround self with people & settings very different from self & origins;
    • Weights homogeneity & sameness as self lower than other traits.
  2. Vision – willingness to look past short-term suitability in pursuit of future;
    • Weights completion rates, future earnings, debt & debt repayment rates higher than current campus setting compatability.
  3. Breadth – willingness to entertain a variety of academic disciplines;
    • Weights predominance of own major discipline lower than otherwise.
  4. Challenge – willingness to embrace highest academic rigor
    • Weights admissions selectiveness, high SAT & higher than otherwise.

Could also add in more or less uncertainty to beta’s according to traits as well…

Traits Have Impact On Four Property Categories Listed in Rows:

Risk Vision Breadth Challenge Properties
1. Self & Origins same to differ —— ——- —- ethn,inc,1stgen,usbrn,pov
2. Campus Setting & Folks homog to diverse —— ——- —- reg,loc,siz
3. Academics —- —— narrow to broad low to high sat/adm sat,admrt,disc
4. After-College Prospects —- low to high dbt/ern ——- —- comprt,earn,rpy,dfltrt
# This maps the student profile property names to the college Bayes factor data table ('studentBF') column names.

propertyMap <- list(
  ethnicity  = c(white='BF_WHITE',black='BF_BLACK',hispanic='BF_HISP',asian='BF_ASIAN'),
  income     = list(dependent   = c(lt30K="BF_DEP_INC_PCT_LOx",gt30Kle110K="BF_DEP_INC_GTLO_LTH2x",gt110K="BF_DEP_INC_PCT_HI2x"),
                    independent = c(lt30K="BF_IND_INC_PCT_LOx",gt30Kle110K="BF_IND_INC_GTLO_LTH2x",gt110K="BF_IND_INC_PCT_HI2x")),
  sat        = c(le800='BF_SAT_le800',  gt800le1000='BF_SAT_gt800le1000', gt1000le1200='BF_SAT_gt1000le1200', 
                 gt1200le1400='BF_SAT_gt1200le1400', gt1400='BF_SAT_gt1400'),
  discipline = setNames(paste0('BF_',discNames),discNames),
  fasfa      = c(fsend_1='BF_fsend_1_2005',fsend_2='BF_fsend_2_2005',fsend_3='BF_fsend_3_2005',fsend_4='BF_fsend_4_2005',fsend_5='BF_fsend_5_2005'),
  region     = c(FarWest='BF_FarWest(AK,CA,HI,NV,OR,WA)',GreatLakes='BF_GreatLakes(IL,IN,MI,OH,WI)',MidEast='BF_MidEast(DE,DC,MD,NJ,NY,PA)',
                 NewEngland='BF_NewEngland(CT,ME,MA,NH,RI,VT)',Plains='BF_Plains(IA,KS,MN,MO,NE,ND,SD)',RockyMountains='BF_RockyMountains(CO,ID,MT,UT,WY)',
                 Southeast='BF_Southeast(AL,AR,FL,GA,KY,LA,MS,NC,SC,TN,VA,WV)',Southwest='BF_Southwest(AZ,NM,OK,TX)'),
  locale     = c(Rural='BF_localeAggRural',TownRemote='BF_localeAggTownRemote',TownDistant='BF_localeAggTownDistant',
                 SuburbSmallMid='BF_localeAggSuburbSmall/Midsize & Town:Fringe',SuburbLarge='BF_localeAggSuburbLarge',
                 CitySmall='BF_localeAggCitySmall',CityMidsize='BF_localeAggCityMidsize',CityLarge='BF_localeAggCityLarge'),
  earnings   = c(le30K='BF_p_le30K',  gt30Kle48K='BF_p_gt30Kle48K', gt48Kle75K='BF_p_gt48Kle75K', gt75Kle110K='BF_p_gt75Kle110K', gt110K='BF_p_gt110K' ),
  prior      = 'BF_prior',
  completion = 'BF_C150_4_POOLED_SUPP',
  admission  = 'BF_ADM_RATE',
  gender     = c(female='BF_female_2005',male='BF_female_2005')
)

Model Parameters – The Partworths of the Utility Function

A critical determinant of how well the tool works is the capturing of specific student value assessments of college properties. This is done using parameters beta in the utility function. The function below does this.

This is just to show one of the many ways in which this can be done.

getParameters <- function(studentProfile,propertyMap){
  # Get the coefficients for the student profile:
  propCategory <- c('SelfOrigin','Setting','Academics','Prospects')
  traitLvls    <- setNames(as.list(studentProfile$traits),c('Risk','Vision','Breadth','Challenge'))
  propsByTrait <- list(Risk      = list(SelfOrigin = c('ethnicity','income'),
                                        Setting    = c('region','locale')), 
                       Vision    = list(Prospects  = c('completion','earnings'),
                                        Setting    = c('region','locale')), 
                       Breadth   = list(Academics  = c('discipline')), 
                       Challenge = list(Academics  = c('sat','admission','discipline')))
  
  # beta <- matrix(rep(1,length(xBF)),ncol=1,dimnames=list(c('ethnicity','income','sat','fasfa'),'Utility'))
  beta <- matrix(rep(1,length(propertyMap)),ncol=1)
  rownames(beta) <- names(propertyMap)
  beta['admission',1] <- -1.0  # prefer schools that have lower admissions rates
  beta['locale'   ,1] <-  0.5  # downweight locale
  beta['fasfa'    ,1] <-  0.3  # downweight fasfa
  # assumes stronger influence of gender for women than men...
  beta['gender'   ,1] <- ifelse(studentProfile$gender == 'female',0.5,-0.1) 
  # (^_^) ...there's gotta be a better way of doing this...
  for(pCat in propCategory){
    prpnm <- propsByTrait$Risk[[pCat]]
    if (!is.null(prpnm)){
      # must have vectors for each trait level with as many elements as are in the vectors of the propsByTrait list
      beta[prpnm,1] <- beta[prpnm,1] * switch(traitLvls$Risk,
                                              L=list(SelfOrigin=c(ethnicity=3.0,income=2.0),
                                                     Setting=c(region=3.0,locale=3.0))[[pCat]],
                                              M=list(SelfOrigin=c(ethnicity=0.3,income=0.3),
                                                     Setting=c(region=1.0,locale=1.0))[[pCat]],
                                              H=list(SelfOrigin=c(ethnicity=0.1,income=0.1),
                                                     Setting=c(region=0.1,locale=0.1))[[pCat]])
    }
    prpnm <- propsByTrait$Vision[[pCat]]
    if (!is.null(prpnm)){
      beta[prpnm,1] <- beta[prpnm,1] * switch(traitLvls$Vision,
                                              L=list(Prospects=c(completion=0.1,earnings=0.1),
                                                     Setting  = c(region=1.5,locale=1.5))[[pCat]],
                                              M=list(Prospects=c(completion=1.0,earnings=0.5),
                                                     Setting  = c(region=1.0,locale=1.0))[[pCat]],
                                              H=list(Prospects=c(completion=3.0,earnings=3.0),
                                                     Setting  = c(region=0.5,locale=0.5))[[pCat]])
    }
    prpnm <- propsByTrait$Breadth[[pCat]]
    if (!is.null(prpnm)){
      beta[prpnm,1] <- beta[prpnm,1] * switch(traitLvls$Breadth,
                                              L=list(Academics=c(discipline=3.0))[[pCat]],
                                              M=list(Academics=c(discipline=1.0))[[pCat]],
                                              H=list(Academics=c(discipline=0.3))[[pCat]])
    }
    prpnm <- propsByTrait$Challenge[[pCat]]
    if (!is.null(prpnm)){
      beta[prpnm,1] <- beta[prpnm,1] * switch(traitLvls$Challenge,
                                              L=list(Academics=c(sat=0.1,admission=0.1,discipline=0.3))[[pCat]],
                                              M=list(Academics=c(sat=1.0,admission=1.0,discipline=1.0))[[pCat]],
                                              H=list(Academics=c(sat=3.0,admission=3.0,discipline=3.0))[[pCat]])
    }
  }
  
  # Also, make the beta's of traits sum to 1 at each level so that final utility
  # is roughly same magnitude as a typical Bayes factor for school properties.
  # This facilitates interpretation of utility in similar fashion as a raw Bayes factor.
  beta[,1] <- beta[,1]/sum(beta[,1])
  
  return(beta)
}

Utility Function to be Maximized for the Student

Now the utility function of the discrete choice model:

#===================================================================
# Here's the Decision Analysis Model: Probabilistic, Utility-Based Discrete Choice Model
utility <- function(studentBF,studentProfile,propertyMap,dump.covariates=FALSE){
  # studentBF must come in as a 1-row matrix!!!
  
  covarnames <- sapply(names(propertyMap),function(propnm){
    if(propnm == 'income'){
      return(propertyMap$income[[ifelse(studentProfile$dependent,
                                        'dependent',
                                        'independent')]][studentProfile[[propnm]]])
    }
    levelName <- studentProfile[[propnm]]
    if(is.null(levelName)) return(propertyMap[[propnm]]) 
    return(propertyMap[[propnm]][levelName])
  })
  if(is.null(dim(studentBF))) {
    xBF <- matrix(studentBF[covarnames],nrow=1,dimnames=list(NULL,covarnames))
  } else {
    xBF <- matrix(studentBF[,covarnames],nrow=1,dimnames=list(NULL,covarnames))
  }
  
  # Assign useful column names...
  colnames(xBF) <- paste(colnames(xBF),names(covarnames),sep=':')
  
  #show(xBF)
  
  # Compute the utility the student derives from the school.
  beta <- setNames(studentProfile$beta,colnames(xBF))
  u    <- xBF %*% beta
  
  if (dump.covariates){
    wxBF <- xBF * matrix(beta,nrow=nrow(xBF),ncol=ncol(xBF),byrow = TRUE)
    colnames(wxBF) <- colnames(xBF)
    return(wxBF)
  } else {
    return(u[1])
  }
}

Archetypical Student Profiles

We now specify student profiles and test out the model to judge how reasonable are its rankings of colleges per specific student criteria.

#=====================================================
# Define (3 levels)^(4 traits) = 3^4 = 81 student trait archetypes:
# Trait level has 4 categories with impact on subtraits as follows when each
# trait increases from low to medium to high:
# 1. Risk      -- willingness to surround self with people & settings very different from self & origins;
#                 i. Weights homogeneity & sameness as self lower than other traits.
# 2. Vision    -- willingness to look past short-term suitability in pursuit of future;
#                 i. Weights completion rates, future earnings, debt & debt repayment rates higher 
#                    than current campus setting compatability.
# 3. Breadth   -- willingness to entertain a variety of academic disciplines;
#                 i. Weights predominance of own major discipline lower than otherwise. 
# 4. Challenge -- willingness to embrace highest academic rigor
#                 i. Weights admissions selectiveness, high SAT & higher than otherwise.
makeTraitLabels <- function(trait,ntrait,levels){
  # Note that this must be called with positive integers for 'trait' & 'ntrait' and ntrait <= length(levels) 
  # (with intention to use ntrait==length(levels)) where 'levels' is a list of character vectors.
  # Will return a character vector with length = prod(sapply(levels,length)).
  if(trait>=ntrait) return(levels[[trait]])
  return(c(sapply(levels[[trait]],paste0,makeTraitLabels(trait+1,ntrait,levels))))
}
traitLevels  <- list(Risk=c('Lr','Mr','Hr'),Vision=c('Lv','Mv','Hv'),Breadth=c('Lb','Mb','Hb'),Challenge=c('Lc','Mc','Hc'))
traitLabels  <- makeTraitLabels(trait=1,ntrait=length(traitLevels),levels=traitLevels)
StudentTrait <- setNames(strsplit(traitLabels,'[a-z]'),traitLabels)
#=====================================================

Modal States of Each Student Property:

a <- studentBF %>% dplyr::select(probSchool,starts_with('BF')) %>% as.matrix
meanBF <- t(a[,1] %*% a[,-1])
# *** PROBLEMATIC FOR MAINTENANCE & REVISION: MUST RESET THESE IF ANY COLUMNS CHANGE IN studentBF ***
inm <- setNames(c(2,6,9,1,1,5,8,8,1,5,1,1,5,38),
                c('dependent','income','ethnicity','gender','pellgrant','fasfa',
                  'locale','region','defaultrate','earnings','admissions',
                  'completion','sat','discipline')) 
propNames <- mapply(function(i,si) rownames(meanBF)[seq(si-i+1,length.out=i)],inm,cumsum(inm))
domProps  <- sapply(propNames,function(nms) names(meanBF[nms,])[which.max(meanBF[nms,])])
show(domProps)
##                                           dependent 
##                              "BF_DEP_STAT_PCT_DEPx" 
##                                              income 
##                             "BF_DEP_INC_GTLO_LTH2x" 
##                                           ethnicity 
##                                          "BF_WHITE" 
##                                              gender 
##                                    "BF_female_2005" 
##                                           pellgrant 
##                                 "BF_pell_ever_2005" 
##                                               fasfa 
##                                   "BF_fsend_3_2005" 
##                                              locale 
##                             "BF_localeAggCityLarge" 
##                                              region 
## "BF_Southeast(AL,AR,FL,GA,KY,LA,MS,NC,SC,TN,VA,WV)" 
##                                         defaultrate 
##                                           "BF_CDR3" 
##                                            earnings 
##                                   "BF_p_gt30Kle48K" 
##                                          admissions 
##                                       "BF_ADM_RATE" 
##                                          completion 
##                             "BF_C150_4_POOLED_SUPP" 
##                                                 sat 
##                               "BF_SAT_gt1000le1200" 
##                                          discipline 
##                                     "BF_Psychology"

Create a few specific student profiles

(Check function ‘utility’ for valid entries for the student profile properties: valid entries appear as the names() of vectors ‘ethlist’, ‘inclist’, etc.)

students <- list(
  Black = list(
    dependent = TRUE,
    ethnicity = 'black',
    gender    = 'female',
    age       = 'le24',
    income    = 'gt30Kle110K',
    earnings  = 'gt110K',
    sat       = 'gt1000le1200',
    fasfa     = 'fsend_4',
    discipline= 'SocialSciences',
    region    = 'MidEast',
    locale    = 'CityLarge',
    traits    = StudentTrait$MrMvMbMc
  ),
  White = list(
    dependent = TRUE,
    ethnicity = 'white',
    gender    = 'female',
    age       = 'le24',
    income    = 'gt30Kle110K',
    earnings  = 'gt110K',
    sat       = 'gt1000le1200',
    fasfa     = 'fsend_4',
    discipline= 'SocialSciences',
    region    = 'MidEast',
    locale    = 'CityLarge',
    traits    = StudentTrait$MrMvMbMc
  ),
  Hispanic = list(
    dependent = TRUE,
    ethnicity = 'hispanic',
    gender    = 'female',
    age       = 'le24',
    income    = 'gt30Kle110K',
    earnings  = 'gt110K',
    sat       = 'gt1000le1200',
    fasfa     = 'fsend_4',
    discipline= 'SocialSciences',
    region    = 'MidEast',
    locale    = 'CityLarge',
    traits    = StudentTrait$MrMvMbMc
  ),
  Asian = list(
    dependent = TRUE,
    ethnicity = 'asian',
    gender    = 'female',
    age       = 'le24',
    income    = 'gt30Kle110K',
    earnings  = 'gt110K',
    sat       = 'gt1000le1200',
    fasfa     = 'fsend_4',
    discipline= 'SocialSciences',
    region    = 'MidEast',
    locale    = 'CityLarge',
    traits    = StudentTrait$MrMvMbMc
  ))

Visualizing Model Parameters

Let’s look at plots of the model parameters for a few specific student types:

# Test getParameters:
plotBeta <- function(beta,stdtProf){
  trtstr <- paste(paste(c('Risk','Vision','Breadth','Challenge'),stdtProf$traits,sep=":"),collapse=', ')
  b <- data_frame(Trait=factor(rownames(beta),levels=rownames(beta)[order(beta[,1],decreasing=TRUE)]),Value=beta[,1])
  gplt <- b %>% 
    ggplot(aes(x=Trait,y=Value,fill=Trait)) 
  (gplt + 
    geom_bar(stat='identity',position='dodge') + 
    #theme(text=element_text(face='bold')) + 
    scale_y_continuous(lim=c(-1,1)/2) +
    ggtitle(sprintf("Utility Partworths for %s",trtstr))) %>% print
}

Note how the parameters change with the behavior traits of Risk, Vision, Breadth & Challenge.

stdtProf <- students$White
beta1 <- getParameters(stdtProf,propertyMap)
plotBeta(beta1,stdtProf)

stdtProf <- students$White
stdtProf$traits <- StudentTrait$HrLvLbHc
beta2 <- getParameters(stdtProf,propertyMap)
plotBeta(beta2,stdtProf)
## Warning: Removed 1 rows containing missing values (geom_bar).

stdtProf <- students$White
stdtProf$traits <- StudentTrait$LrHvHbLc
beta3 <- getParameters(stdtProf,propertyMap)
plotBeta(beta3,stdtProf)

Generating Personalized College Rankings

What follows is function studentCaseStudy the ‘work horse’ function to provide the student with personalized college rankings:

# Define a function to assess the suitability of colleges for a specific student's profile:
studentCaseStudy <- function(studentBF,stdtProf,propertyMap,ntop=20,signifLevel=0.15,verbose=FALSE,exact=FALSE){
  utilities <- studentBF %>% 
    dplyr::select(starts_with('BF')) %>%
    apply(1,utility,studentProfile=stdtProf,propertyMap=propertyMap) %>%
    setNames(studentBF$College)
  
  tmp <- studentBF %>% 
    dplyr::select(starts_with('BF')) %>%
    slice(1) %>% as.matrix %>%
    utility(studentProfile=stdtProf,propertyMap=propertyMap,dump.covariates=TRUE)
  
  wxBF <- studentBF %>% 
    dplyr::select(starts_with('BF')) %>%
    apply(1,utility,studentProfile=stdtProf,propertyMap=propertyMap,dump.covariates=TRUE) %>%
    t %>% as.data.frame %>% tbl_df %>%
    setNames(gsub('^[^:]+:','',colnames(tmp))) %>%
    mutate(unitID  = studentBF$unitID,
           College = studentBF$College,
           Utility = utilities) %>% 
    arrange(desc(Utility)) %>%
    dplyr::select(unitID,College,Utility,everything())
  
  labels <- sprintf("%d. %s",order(order(utilities,decreasing=TRUE)),names(utilities))
  
  stdtUtility <- data_frame(unitID  = studentBF$unitID, 
                            College = names(utilities),
                            Utility = utilities) %>%
    mutate(labels = factor(labels,levels=labels[order(Utility,decreasing=TRUE)]),
           expChoice = 10^Utility,
           ChoicePct = 100*expChoice/sum(expChoice)) %>%
    dplyr::select(-expChoice) %>%
    inner_join(wxBF %>% dplyr::select(-College,-Utility),by='unitID') %>% 
    arrange(desc(Utility)) #%T>% print
  
  topN <- stdtUtility %>% dplyr::select(-College) %>% top_n(ntop,Utility) 
  
  if(verbose) {
    topN %>% print
    traitStr <- paste(paste(c('Risk','Vision','Breadth','Challenge'),stdtProf$traits,sep=":"),collapse=', ')
    profText <- stdtProf[!grepl('beta|traits',names(stdtProf))]
    pltTitle <- paste(strwrap(paste(paste(names(profText),profText,sep=':'),collapse=', '),width = 80),collapse='\n')
    pltTitle <- paste0("Top ",ntop," Colleges for Profile:\n",pltTitle,'\n',traitStr)
    
    gplt0 <- topN %>% 
      mutate(labels = factor(labels,levels=rev(levels(labels))),
             labelY = pmin(0,min(Utility))) %>%
      ggplot(aes(x=labels,y=Utility,fill=labels))
    gplt <-  gplt0 +
      geom_bar(stat='identity',position='dodge') + coord_flip() + 
      geom_text(aes(x=labels,y=labelY,label=labels),size=6,hjust=0,vjust=0.5) +
      theme(text=element_text(face='bold',size=8),
            axis.text.y=element_blank(),
            title=element_text(size=12,face='bold'),
            legend.position = "none") +
      ggtitle(pltTitle) +
      labs(x=sprintf("Top %d Colleges",ntop)) 
    gplt %>% print
  }
  
  # Back-calculate the beta coefficients and see if can determine which
  # covariate contributed most strongly to the utility rankings.
  stdtBF <- stdtUtility %>% 
    inner_join(studentBF %>% dplyr::select(-College),by='unitID')
  axbf <- stdtBF %>% dplyr::select(one_of(gsub('^[^:]+:','',colnames(tmp)))) %>% as.matrix
  betacoeff <- solve(t(axbf) %*% axbf,t(axbf) %*% matrix(stdtUtility$Utility,ncol=1))
  
  if(verbose) show(betacoeff[betacoeff>1.0E-9,])
  
  stdtBF %<>%
    dplyr::select(c(4,3,5),one_of(names(betacoeff[abs(betacoeff)>1.0E-9,]))) 
  
  if(verbose) stdtBF %>% print(n=ntop,width=Inf)
  
  utlBFcor <- stdtBF %$% 
    lapply(.[-(1:3)],function(y) cor.test(Utility[1:ntop],y[1:ntop],method='spearman',exact=exact)) #%T>% print
  
  # This shows which covariates dictate the rankings by utility. If significant
  # negative correlation, then that covariate is important in the latter half of
  # the ranking (note that reverse ordering due to negative coefficients is
  # already accounted for).
  pvals <- sapply(utlBFcor,'[[','p.value')
  pvals <- pvals[!is.na(pvals)]
  
  signifCor <- utlBFcor[pvals <= pmax(signifLevel,min(pvals,na.rm=TRUE))] %$% {.[order(sapply(.,'[[','p.value'))]}
  
  if(verbose) signifCor %>% show
  
  results <- list(student=stdtProf,topN=topN,utilities=stdtUtility,BF=stdtBF,cor=utlBFcor,signifCor=signifCor,beta=betacoeff)
  invisible(results)
}

Student Case Studies

Ideally, we’d build a nice GUI/app around this model so that a student can enter his or her specific information and get a custom ranking of colleges. Here, we’ll just manually specify a student and assess the results:

# Assess a specific student's case:
stdtProf <- students$Black
#stdtProf$gender <- 'male'
stdtProf$beta <- getParameters(stdtProf,propertyMap)
caseResult    <- studentCaseStudy(studentBF,stdtProf,propertyMap,verbose=TRUE,ntop=20)
## # A tibble: 20 x 16
##    unitID   Utility                                          labels
##     <dbl>     <dbl>                                          <fctr>
##  1 131159 0.3505585                          1. American University
##  2 131469 0.3465406                 2. George Washington University
##  3 193900 0.3294539                          3. New York University
##  4 131496 0.3248710                        4. Georgetown University
##  5 190664 0.3199828                          5. CUNY Queens College
##  6 190594 0.3049034                          6. CUNY Hunter College
##  7 131520 0.2947132                            7. Howard University
##  8 215293 0.2683836   8. University of Pittsburgh-Pittsburgh Campus
##  9 216339 0.2673033                            9. Temple University
## 10 191241 0.2550179                          10. Fordham University
## 11 190637 0.2513853                         11. CUNY Lehman College
## 12 215062 0.2511576                  12. University of Pennsylvania
## 13 189097 0.2469422                             13. Barnard College
## 14 190512 0.2399130               14. CUNY Bernard M Baruch College
## 15 190567 0.2286066                           15. CUNY City College
## 16 190549 0.2281797                       16. CUNY Brooklyn College
## 17 195809 0.2274490               17. St John's University-New York
## 18 190150 0.2245557 18. Columbia University in the City of New York
## 19 190600 0.2183074   19. CUNY John Jay College of Criminal Justice
## 20 186399 0.2127885                   20. Rutgers University-Newark
## # ... with 13 more variables: ChoicePct <dbl>, ethnicity.black <dbl>,
## #   income.gt30Kle110K <dbl>, sat.gt1000le1200 <dbl>,
## #   discipline.SocialSciences <dbl>, fasfa.fsend_4 <dbl>,
## #   region.MidEast <dbl>, locale.CityLarge <dbl>, earnings.gt110K <dbl>,
## #   prior <dbl>, completion <dbl>, admission <dbl>, gender.female <dbl>
##           ethnicity.black        income.gt30Kle110K 
##                         1                         1 
##          sat.gt1000le1200 discipline.SocialSciences 
##                         1                         1 
##             fasfa.fsend_4            region.MidEast 
##                         1                         1 
##          locale.CityLarge           earnings.gt110K 
##                         1                         1 
##                     prior                completion 
##                         1                         1 
##                 admission             gender.female 
##                         1                         1 
## # A tibble: 1,119 x 15
##                                             labels   Utility ChoicePct ethnicity.black income.gt30Kle110K sat.gt1000le1200 discipline.SocialSciences fasfa.fsend_4 region.MidEast locale.CityLarge earnings.gt110K        prior   completion    admission gender.female
##                                             <fctr>     <dbl>     <dbl>           <dbl>              <dbl>            <dbl>                     <dbl>         <dbl>          <dbl>            <dbl>           <dbl>        <dbl>        <dbl>        <dbl>         <dbl>
##  1                          1. American University 0.3505585  1.740040    -0.011261582        0.010633611     -0.005972833               0.109420682  0.0119673501      0.1211888       0.04571754     0.024259747 -0.001754930  0.018397979  0.025061210  2.900901e-03
##  2                 2. George Washington University 0.3465406  1.724017    -0.010852213        0.008589604     -0.027228160               0.091504238  0.0054844109      0.1211888       0.04571754     0.024593490  0.024942931  0.020047113  0.043047656 -4.948562e-04
##  3                          3. New York University 0.3294539  1.657505    -0.017677039       -0.005417375     -0.057770376               0.041273125  0.0027660389      0.1211888       0.04571754     0.035236322  0.078158782  0.023282996  0.059259808  3.435231e-03
##  4                        4. Georgetown University 0.3248710  1.640106    -0.011325299        0.008161051     -0.115293974               0.095944416  0.0054844109      0.1211888       0.04571754     0.056551517  0.002407464  0.030022356  0.088341977 -2.329315e-03
##  5                          5. CUNY Queens College 0.3199828  1.621749    -0.008638860       -0.012695869      0.034202974               0.066853462 -0.0040837124      0.1211888       0.04571754    -0.005113733  0.048983914 -0.004744205  0.035411575  2.900901e-03
##  6                          6. CUNY Hunter College 0.3049034  1.566406    -0.002115287       -0.010535503      0.020664480               0.043782752  0.0027660389      0.1211888       0.04571754    -0.013899521  0.054723124 -0.014661894  0.050315889  6.956979e-03
##  7                            7. Howard University 0.2947132  1.530080     0.042963634        0.004367055      0.016306909               0.017057312  0.0119673501      0.1211888       0.04571754     0.012589169 -0.001497064  0.002587864  0.015972434  5.492180e-03
##  8   8. University of Pittsburgh-Pittsburgh Campus 0.2683836  1.440073    -0.014495363        0.009664556     -0.007969455               0.008909767 -0.0040837124      0.1211888       0.04571754     0.019818547  0.065149585  0.019644596  0.007168040 -2.329315e-03
##  9                            9. Temple University 0.2673033  1.436495     0.003963264       -0.002928684      0.017138871              -0.017771861  0.0078821866      0.1211888       0.04571754    -0.005530387  0.092799642  0.006936569 -0.001597878 -4.948562e-04
## 10                          10. Fordham University 0.2550179  1.396429    -0.019547693       -0.008971455     -0.005972833               0.053166857  0.0078821866      0.1211888       0.04571754     0.007430305  0.010602233  0.020533599  0.019027040  3.961276e-03
## 11                         11. CUNY Lehman College 0.2513853  1.384797     0.018762441       -0.023215424      0.017636355               0.038486754 -0.0003720923      0.1211888       0.04571754    -0.024735980  0.017127792 -0.034460746  0.066888675  8.361150e-03
## 12                  12. University of Pennsylvania 0.2511576  1.384071    -0.008555141        0.006231782     -0.185599333               0.045106055 -0.0040837124      0.1211888       0.04571754     0.061474646  0.028119212  0.032139192  0.110513918 -1.095389e-03
## 13                             13. Barnard College 0.2469422  1.370702    -0.014909684        0.015408227     -0.053061621               0.078162061 -0.0003720923      0.1211888       0.04571754     0.003863905 -0.070326178  0.027615745  0.075418992  1.823642e-02
## 14               14. CUNY Bernard M Baruch College 0.2399130  1.348695    -0.003807438       -0.016592599      0.007542255              -0.038546347  0.0027660389      0.1211888       0.04571754     0.009227264  0.045479453  0.005518720  0.062514653 -1.095389e-03
## 15                           15. CUNY City College 0.2286066  1.314037     0.010126127       -0.014222492      0.011818418               0.017979735  0.0054844109      0.1211888       0.04571754    -0.018499901  0.035473252 -0.023836359  0.040986893 -3.609819e-03
## 16                       16. CUNY Brooklyn College 0.2281797  1.312745     0.013754144       -0.013128303      0.021899884              -0.037986686 -0.0003720923      0.1211888       0.04571754    -0.001532219  0.039944175 -0.009006110  0.045342506  2.358022e-03
## 17               17. St John's University-New York 0.2274490  1.310539     0.011219776        0.006049773      0.010115935              -0.033930189  0.0100270684      0.1211888       0.04571754     0.017525892  0.031495209 -0.002913229  0.010857113  9.523216e-05
## 18 18. Columbia University in the City of New York 0.2245557  1.301837    -0.008059933       -0.015046686     -0.230889587               0.070639611  0.0027660389      0.1211888       0.04571754     0.055788417  0.008729634  0.030949403  0.144479154 -1.706743e-03
## 19   19. CUNY John Jay College of Criminal Justice 0.2183074  1.283241     0.011251379       -0.011361383     -0.015543043               0.036841619 -0.0003720923      0.1211888       0.04571754    -0.014592864  0.041079485 -0.039889404  0.041086466  2.900901e-03
## 20                   20. Rutgers University-Newark 0.2127885  1.267037     0.010533622       -0.003568755      0.017435291              -0.004135841  0.0027660389      0.1211888       0.04571754     0.012868068 -0.002564624  0.005096916  0.009780675 -2.329315e-03
## # ... with 1,099 more rows
## Warning in cor(rank(x), rank(y)): the standard deviation is zero

## Warning in cor(rank(x), rank(y)): the standard deviation is zero

## $discipline.SocialSciences
## 
##  Spearman's rank correlation rho
## 
## data:  Utility[1:ntop] and y[1:ntop]
## S = 616, p-value = 0.01466
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.5368421 
## 
## 
## $ethnicity.black
## 
##  Spearman's rank correlation rho
## 
## data:  Utility[1:ntop] and y[1:ntop]
## S = 2000, p-value = 0.02354
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.5037594 
## 
## 
## $income.gt30Kle110K
## 
##  Spearman's rank correlation rho
## 
## data:  Utility[1:ntop] and y[1:ntop]
## S = 764, p-value = 0.06138
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.4255639 
## 
## 
## $gender.female
## 
##  Spearman's rank correlation rho
## 
## data:  Utility[1:ntop] and y[1:ntop]
## S = 989.72, p-value = 0.2763
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.2558509 
## 
## 
## $admission
## 
##  Spearman's rank correlation rho
## 
## data:  Utility[1:ntop] and y[1:ntop]
## S = 1362, p-value = 0.9198
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##         rho 
## -0.02406015

How About a Road-Trip to Your Top-20 Schools?!

I’ve copied idea and code snips from Tad Dallas’s script on Kaggle.com:

mortarBoardIcon <- makeIcon(iconUrl = "https://upload.wikimedia.org/wikipedia/commons/thumb/7/76/Fxemoji_u1F393.svg/512px-Fxemoji_u1F393.svg.png", 
    iconWidth = 30, iconHeight = 30)

schools <- caseResult$topN %>% inner_join(student %>% dplyr::select(c(unitID, 
    College, city, state, zip, median_hh_inc_2005, lon, lat)), by = "unitID") %>% 
    mutate(info = paste0(College, "<br>", city, ", ", state, " ", zip, sprintf("<br> Rank: %d Utility: %.2f", 
        seq_along(Utility), Utility)))
map <- leaflet(schools) %>% setView(-93.65, 42.0285, zoom = 4) %>% addTiles() %>% 
    addMarkers(~lon, ~lat, popup = ~info, options = popupOptions(closeButton = TRUE), 
        clusterOptions = markerClusterOptions(), icon = mortarBoardIcon)
# print(map)
map

Sensitivity Analysis

Finally, we’ll do a brief set of sensitivity analyses illustrating how the college rankings change with the student’s demographics and behavioral traits.

We generate barplots of the utilities of the top 20 colleges for specific student profiles. (You may want to just execute print(sensResults$grobList) at the console after running the script just to see individual top-10 school barplots in a legible format.)

#=======================================
### SENSITIVITY ANALYSIS:
# Define a function for one-variable-at-a-time analysis of the student profile properties:
#if(FALSE){
# paropts <- par(no.readonly=TRUE)
sensitivity <- function(studentBF,stdtProf0,pmap,StudentTrait,propertyMap,maxcnt=Inf,plot.it=TRUE){
  #require(gridExtra)
  propnms <- names(pmap)
  cnt <- 0
  
  caseList <- list()
  grobList <- list()
  for(propnm in propnms){
    stdtProf <- stdtProf0
    # par(mfrow=c(2,ceiling(length(pmap[[propnm]])/2)))
    if(propnm == 'income') {
      if(stdtProf$dependent) {
        lvls <- names(pmap[[propnm]]$dependent)
      } else {
        lvls <- names(pmap[[propnm]]$independent)
      }
    } else {
      lvls <- names(pmap[[propnm]])
    }
    for(levelnm in lvls){
      stdtProf[[propnm]] <- levelnm
      stdtProf$beta <- getParameters(stdtProf,propertyMap)
      caseResult    <- studentCaseStudy(studentBF,stdtProf,propertyMap,ntop = 10,verbose = FALSE)
      
      profText <- c(stdtProf[propnm],stdtProf[!grepl(paste0('beta|traits|',propnm),names(stdtProf))])
      pltTitle <- paste(strwrap(paste(paste(names(profText),profText,sep=':'),collapse=', '),width = 80),collapse='\n')
      traitStr <- paste(paste(c('Risk','Vision','Breadth','Challenge'),stdtProf$traits,sep=":"),collapse=', ')
      pltTitle <- paste0("Top 10 Colleges for Profile:\n",pltTitle,'\n',traitStr)
      
      nms <- names(caseList)
      caseList <- setNames(c(caseList,list(caseResult)), c(nms,pltTitle))
      
      #if(plot.it){
      gplt0 <- caseResult$topN %>% 
        mutate(labels = factor(labels,levels=rev(levels(labels))),
               labelY = pmin(0,min(Utility))) %>%
        ggplot(aes(x=labels,y=Utility,fill=labels))
      gplt <-  gplt0 +
        geom_bar(stat='identity',position='dodge') + coord_flip() + 
        geom_text(aes(x=labels,y=labelY,label=labels),size=6,hjust=0,vjust=0.5) +
        theme(text=element_text(face='bold',size=8),
              axis.text.y=element_blank(),
              title=element_text(size=12,face='bold'),
              legend.position = "none") +
        ggtitle(pltTitle) +
        labs(x="Top 10 Colleges") 
      #gplt %>% print
      grobList <- c(grobList,list(gplt))
      #}
      
      cnt <- cnt + 1
      if(cnt>=maxcnt) break
    }
    if(cnt>=maxcnt) break
  }
  # par(paropts)
  
  ### UNFORTUNATELY, SEEMS TO LOSE CONTROL OF TEXT SIZE WHEN ggplots ARE LAID IN
  ### GRID USING PACKAGE 'gridExtra'.... HELP!!!
  #gmulti <- marrangeGrob(grobList,nrow=2,ncol=4)
  if(plot.it) do.call(grid.arrange,grobList)
  return(list(caseList=caseList,grobList=grobList))
}

Sensitivity Analysis #1:

Perform sensitivity analysis for a stellar female student interested in Engineering and a Large City setting, with a trait profile of Risk=High, Vision=Low, Breadth=High, Challenge=High. We’ll sweep over regions.

Note that because Breadth is High, she won’t stick to strictly engineering schools. And because Vision is Low, she stresses region and locale more strongly than future earnings.

stdtProf0            <- students$White
stdtProf0$gender     <- 'female'
stdtProf0$discipline <- 'Engineering'
stdtProf0$sat        <- 'gt1400'
stdtProf0$traits     <- StudentTrait$HrLvHbHc # Risk=High, Vision=Low, Breadth=High, Challenge=High

pmap            <- propertyMap[c('region','ethnicity')]
#pmap$ethnicity <- pmap$ethnicity[1:4]

maxcnt      <- length(pmap[[1]]) # Only flip ethnicity
sensResults <- sensitivity(studentBF,stdtProf0,pmap,StudentTrait,propertyMap,maxcnt = maxcnt, plot.it = FALSE)
print(sensResults$grobList)
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

Sensitivity Analysis #2:

Perform sensitivity analysis for a stellar male student interested in Engineering and a Large City setting, with a trait profile of Risk=Low, Vision=Low, Breadth=High, Challenge=High. We’ll sweep over ethnicity; then for fixed ethnicity=White, we’ll sweep over region.

Note that because Risk is Low, he’ll strongly stress region & seek student populations that look a little more like himself.

And in the case of Region = New England, the student’s top-10 even has schools with negative utility!

stdtProf0            <- students$White
stdtProf0$gender     <- 'male'
stdtProf0$discipline <- 'Engineering'
stdtProf0$sat        <- 'gt1400'
stdtProf0$traits     <- StudentTrait$LrLvHbHc    # Risk=Low, Vision=Low, Breadth=High, Challenge=High

pmap           <- propertyMap[c('ethnicity','region')]
pmap$ethnicity <- pmap$ethnicity[1:4]
pmap$region    <- pmap$region[c('NewEngland','Southeast','GreatLakes','FarWest')]

maxcnt      <- Inf
sensResults <- sensitivity(studentBF,stdtProf0,pmap,StudentTrait,propertyMap,maxcnt = maxcnt, plot.it = FALSE)

print( sensResults$grobList )
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

Sensitivity Analysis #3:

Now, let’s see what happens when that same stellar male student shows up with a trait profile with more risk and has a longer-term vision:

stdtProf0            <- students$White
stdtProf0$gender     <- 'male'
stdtProf0$discipline <- 'Engineering'
stdtProf0$sat        <- 'gt1400'
stdtProf0$traits     <- StudentTrait$HrHvHbHc # Risk=High, Vision=High, Breadth=High, Challenge=High

pmap           <- propertyMap[c('ethnicity','region')]
pmap$ethnicity <- pmap$ethnicity[1:4]
pmap$region    <- pmap$region[c('NewEngland','Southeast','GreatLakes','FarWest')]

maxcnt      <- Inf
sensResults <- sensitivity(studentBF,stdtProf0,pmap,StudentTrait,propertyMap,maxcnt = maxcnt, plot.it = FALSE)
print(sensResults$grobList)
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

#}
#------------------------------------------------------------

Comments on Sensitivity Analyses

As you can see, the student’s value assessment, in terms of beta parameters (partworths), as determined by his or her behavioral traits and preferences strongly affects the top 10 high-utility schools for that student.

Experiment with different student profiles, such as lower SAT’s, and interest in alternative settings (locales other than Large City). The results are fascinating!

Even better, alter the beta parameters directly yourself in stdtProf0$beta….

Next Steps

Please, feel free to extend this model and tool. It’d be great to slap a quick Shiny interface on it so that a student can input his or her own demographics and criteria and quickly generate lists and charts detailing the student’s best college options.

Being posed in a probabilistic discrete choice modeling framework, this model represents lots of potential. It can, for example, be presented with individual student data – imagine the entire dataset of applicants to a university plus the indication of which applicants were accepted or not – and then, using tools such as the package rstan, we could estimate data-based model parameters beta suitable for identifying candidates for a specific college. This essentially reverse engineers the college’s admissions criteria.

It also offers a principled way of identifying which specific college is or is not a good choice for a student.

Even in its simplistic form shown here – which didn’t implement even some of the computed college properties – it’s quite entertaining to see how varied the rankings are given specific students – e.g., a top student (SAT>1400) with a strong regional preference makes quite interesting choices as the region of interest changes. Try it out!

Any feedback would be appreciated.

Thanks,

-Michael L. Thompson

(my info at LinkedIn)